Solution 1: Obtain a version of the document that does not contain renderable editable text. This message appears if the PDF document already contains editable text. Obtain a copy of the document that does not contain editable text. Arrange the files in the Files to Combine section in the way that you want them to appear in the new PDF. It recognizes each character and turns it into editable text. Acrobat compares the image shape and line thickness to the fonts installed on the system during the OCR process.
|Published (Last):||8 June 2007|
|PDF File Size:||4.73 Mb|
|ePub File Size:||5.81 Mb|
|Price:||Free* [*Free Regsitration Required]|
Grant Sheridan Robertson's personal blog. Ideas, thoughts, and various things I would like to share with the world. Notice, I am not saying it is "The" solution. Using this technique, it is possible to obtain a searchable and text-select-able document while preserving the original image of the scanned document, if desired.
It also makes for some extraneously large files. Fortunately we don't have to leave our files in this format. It is merely used as a transitional format, the conversion to which, strips out the bothersome "renderable text. This could take quite some time depending on how much "rendered text" i. Text that is actually only an image should convert rather quickly because this process seems to simply move the image portions of the documents straight over without any conversion or alteration whatsoever.
Though I am not positive, the little bit of poking around in the document I did, causes me to speculate that the. XPS printer driver converts each and every character in the document into a vector graphic, similar to an Adobe postscript file. As you can imagine, this makes for an incredibly large file see the table below and it takes a really long time.
I would suggest you start this process and then go off to a long lunch or meeting. If you have a separate computer on which you can run these processes, more's the better.
Now this step is really going to take a long time, perhaps hours. If you have a large document with lots of "rendered text," I recommend that you start the process before going to bed or before leaving the office for the night.
In addition, once you have started this process, it will look as if your computer isn't doing anything at all for almost the entire time. PDF document to show. I do have to admit that this conversion does seem to produce slightly blurier images for scanned documents. It appears that either Acrobat or the XPS driver does a little bit of antialiasing of the jagged edges. Which you choose depends on the original document and the intended use for the final document.
Most academics will be dealing with scanned documents, where the "document" is actually just a series of images of pages stored in the. PDF file. Now, said academic may want to preserve the original image of the document for possible scrutinizing or grabbing snapshots from in the future. This produces a pretty large file. However, if the file was really just a series of images to begin with, then the resulting file may not be much larger than the original.
On the other hand, our imaginary academic may want to produce the smallest possible file size, or may have hopes of producing a file that is easier to read than the scanned original.
It also sometimes completely gives up and just places a small image of the word - or just a couple of letters - in the spot where those letters should have gone. It is acceptably readable but it looks weird and those words or letters aren't selectable.
The plain "Searchable Image" output style is a decent middle of the road option, but it does modify the look of the page images because they are compressed.
You should experiment to make sure you can tolerate the results. Some of the documents that cause the "renderable text" error look as if they were generated by a computer "born digital," as some are saying these days but either some of the text is not selectable or it is selectable but the copied text is gibberish. Many people suspect this is meant to prevent people from copying any of the document for use elsewhere.
It also makes the document practically useless for any academic or business purpose. For these kinds of documents, the. XPS file can be ginormous; ten to twenty times the size of the original.
The "Searchable Image exact " output style does produce the best looking result - the final document looks exactly like the original - but the final. PDF file size is only slightly less ginormous than the.
XPS file. This is because all the vector images of all the individual characters in the document are retained when using this OCR output style. While that isn't a problem for a mostly-image scanned document because there is a relatively small amount of "rendered text," it is a nightmare for mostly-text documents because of the vast quantity of individual vectors they contain. So, only use the "Searchable Image exact " output style if the document also contains images which you absolutely must retain in their original quality.
If the most important images are on separate pages from the text then one could selectively OCR only the pages with text using the ClearScan output style. I do not recommend the plain "Searchable Image" output style because it produces really poor quality character renderings. It is readable and selectable but it is much more difficult to read than documents produced using either the "Searchable Image exact " or the "ClearScan"output style.
The ClearScan output style results in very nice looking text as well as files that are usually less than twice the size of the original, sometimes even smaller than the original. However, the images within the document may not look as good as the originals. Again, some selective OCRing may produce a more optimum result, but that requires more manual labor, which we are trying to avoid. The chart below shows the resulting file sizes.
If there is nothing in a cell, that means I didn't think it was worth trying that conversion. As you can see, the results vary dramatically. Note, however, that pages with the most text produced the greatest increase in size when printing to the. I haven't performed similar tests on mostly-image documents at this time. Perhaps I will do so later.
Such is the luxury of doing all this only for my own edification and sharing the information completely free without any ads even. Hopefully, this article will be a big help for: A all those students out there trying to OCR all those papers they have collected in their research so they can pull quotes out of them without retyping everything, as well as B those archivists out there who are trying to make the documents in their collections searchable.
Though I have not done so, it should also be possible to write some kind of script that would completely automate this process for batch-processing lots of files at the same time.
If this helps you, please let me know. If you have any questions or suggestions, please don't hesitate to contact me.
How to remove Renderable Text from. Permissions beyond the scope of this license may be available here. Thanks Grant for taking the time to documenting the process in such detail. So sorry to report that despite diligently following the steps, the " It cannot be captured". I have tried specifying different output styles and starting from scratch deleting the transitional files a number of times - the latter because I noticed that after right-clicking and converting the.
XPS file, doing it again accidentally or deliberately , did nothing - even if I deleted the. PDF created the first time.
Reboot required before retry of. XPS to. PDF conversion worked. By the wording of the message you received, it seems that the "offending" graphic is something that is drawn with vector graphics rather than a raster image. After you have found the graphic s that block OCR you could open the original document and try to copy and paste the graphics back into your OCRed file. You have to have the Touchup Object Tool selected in both documents to complete the copy and paste.
I know this is incredibly tedious, but I can think of no other way to accomplish this and still preserve the quality of the "offending" graphics. Of course there is still always the convert to TIFF and back method but that will rasterize and pixelate your graphics.
I hope this helps. I'm not the original "anonymous" but I didn't have success either. The document I converted back to pdf still had renderable text in it although not as much as it did originally and after OCR recognition was completed, the remaining text was so blurry it could not be read. Anonymous 2: You need to tell me more about your document and what you did. Was it a scanned document or "born digital"? I have updated the instructions, included the section on converting back to.
PDF format. I had a similar problem while recognizing an page document. That did it for me: adobe X click on tools, click on crop,use the mouse to select the whole page,click tools, click ocr and voila!
That is an incredible tip, Jonny. I have had similar experiences with other software "back in the day" but not recently. Sometimes software just doesn't handle certain patterns of data sequences within their own data. It will read the file and not raise any red flags. But once it tries to do a certain function then it chokes on just a few bytes that are in a sequence it doesn't expect. Rather than pop up a dialog and ask what you want to do, the software just chokes.
I had thought Adobe had learned better than this by now. Hi Grant, I had this problem as well. I must say it was a small 6 page document, so maybe it worked that way. Anyway, just to say congratulations on the article, and please keep doing this useful work.
From Portugal Thank you for the article. It inspired me to use Automator on my Mac to basically create the workflow you described. It seems to have help the problem so far! Just thought I'd give the Automator shout-out for Mac users who may have stumbled here via Google looking for a solution like me!
Fix the OCR error Could Not Perform Recognition in Acrobat
This message will sometimes occur when trying to make a scanned paper PDF file text searchable also know as adding OCR to a document. Depending on the version of Acrobat you have, the message may read something like:. The way this text is encoded into the page can cause Acrobat to disallow additional searchable text OCR text. This message can certainly be annoying and it can also be significant as it can limit your ability to run searches.
Adobe Acrobat: “Renderable Text”