Best scan options for PDF

geohei · Post by **geohei** » 2014-06-01T02:00:33-07:00

Hi.

I need to convert the following PDF into TIF.
https://www.dropbox.com/s/6yo378t1mu4j7 ... 36.idp.pdf
Later on, it should be parsed using tesseract (OCR software).

tesseract produces quite a lot of errors and false character recognition. Its available options are also limited.

The difficulty in this particular PDF is, that it uses user defined fonts.

Code: Select all

A
    Type: Type 3
    Encoding: Custom
    Actual Font: A
    Actual Font Type: Type 3

I'd like to know which options are optimum for converting subject PDF in order to get best results from tesseract?

So far I used, but results are not that good.

Code: Select all

$ convert -monochrome -density 600 in.pdf out.tif

Even increasing density didn't really help.
https://www.dropbox.com/s/uxwd4k6pb1orz69/tmp.tif
There are white dots inside the letters and lots of steps. Apparently too much for tesseract.
https://www.dropbox.com/s/c35hvnp07w8e9cd/tesseract.txt

Many thanks,

Post by **snibgo** » 2014-06-01T07:54:10-07:00

You may find that supersampling helps, without making it monochrome, eg:

Code: Select all

convert -density 2400 1400769600930.HEI-dmz2-prd-crewlink.2336.idp.pdf -resize 25% x.png

geohei · Post by **geohei** » 2014-06-01T09:40:42-07:00

Uuuhhh... this will eat up a lot of resources (not monchrome and desity 2400). I'll give it a try.

What about some fancy image processing features like -adaptive-blur, ... ?
This was more the direction I was thinking about.

Post by **snibgo** » 2014-06-01T10:21:47-07:00

"-density 2400" but I then resize by 25%. The resulting file will have the same number of pixels as density 600 but, of course, it takes longer to make. You might then "-trim" to reduce the pixels.

I don't use tesseract or any OCR, so can't test.

With my human eyes, characters with anti-aliasing is much easier to read than "-monochrome", which is black/white with no gray. I don't know if the same is true of OCR.

Adobe Reader shows your PDF with no anti-aliasing, ie the characters have jagged edges. I doubt if "-adaptive-blur" would help, but an ordinary small blur, eg "-blur 0x0.5" might.

geohei · Post by **geohei** » 2014-06-03T09:37:34-07:00

If I don't use -monochrome, tesseract gives the follwoing error:

Code: Select all

Error in pixReadFromTiffStream: can't handle bpp > 32
Error in pixReadStreamTiff: pix not read
Error in pixReadStream: tiff: no pix returned
Error in pixRead: pix not read
Error in pixGetInputFormat: pix not defined
Reading tmp/tmp.tif as a list of filenames...
Error in fopenReadStream: file not found
Error in pixRead: image file not found
Image file II* cannot be read!
Error during processing.

Hence, I can't check how OCR reacts upon. So ... let's stick to "-monochrome" for the time being.

How exactly did you get to "-blur 0x0.5"? I'd like to understand how you figured out these values.
Why not using simple blur ("-blur a")?

Regarding performance "density 1200 -resize 50%" is still acceptable, but "density 2400 -resize 25%" turn into a 5 minutes exercise on my hardware. Unacceptable for my purpose. "density 1200 -resize 50%" isn't working very well either for OCR.

While testing different "-blur axb" options, I found out that "-blur 1x2" worked best, but far from being optimum. Characters get in contact and are badly interpreted, but jerks in slopes are reduced.

Isn't there any other convert option which might possibly work better that?

BTW ... is the order in which options are placed relevant?

Post by **snibgo** » 2014-06-03T09:55:56-07:00

Yes, the order is relevant. Processes are executed in the order you give them.

Code: Select all

convert -density 600 -background White 1400769600930.HEI-dmz2-prd-crewlink.2336.idp.pdf -alpha off -blur 1x65000 -threshold 50% -monochrome x.tiff

The result looks quite good, but I'm afraid you need to trial-and-error. I have no experience with OCR.

geohei · Post by **geohei** » 2014-06-04T04:04:58-07:00

I spent quite some time now on experimenting.

Your suggested options are not as good as this here:

Code: Select all

convert -density 600 -blur 1x2 -monochrome in.pdf out.tif

... seems to give the best results, but tesseract still fails while hitting partially overlapping characters.

I don't know the meaning and effect of all convert options (there are MANY as you know ...), but there must be something which still permits better results.

Which OS are you using? Windows? There is a tesseract version for windows.
http://code.google.com/p/tesseract-ocr/ ... e&can=2&q=

Legacy ImageMagick Discussions Archive

Best scan options for PDF

Best scan options for PDF

Re: Best scan options for PDF

Re: Best scan options for PDF

Re: Best scan options for PDF

Re: Best scan options for PDF

Re: Best scan options for PDF

Re: Best scan options for PDF