Page 1 of 1
Help Improving Text Of Scanned Image 4 OCR
Posted: 2014-05-28T10:28:03-07:00
by lindylex
I have a image pdf page. I would like to convert the page to an image and extract the text.
This is the pdf page.
http://mo-de.net/d/out.pdf
This is what I have tried.
Convert the pdf page to an image.
Code: Select all
convert -density 200 –antialias -sharpen 0x3.0 -colorspace GRAY out.pdf t5.png
I use the following to clean up the gray at the bottom with solid white.
Code: Select all
convert -fuzz 30% -fill "#ffffff" -opaque "#f2f2f2" t5.png t6.png
Convert the image to text.
My 2nd question is. How can I pipe the two convert commands together?
My FAILED attempt.
Code: Select all
convert -density 200 –antialias -sharpen 0x3.0 -colorspace GRAY out.pdf - | convert -fuzz 30% -fill "#ffffff" -opaque "#f2f2f2" - - t5.png
Re: Help Improving Text Of Scanned Image 4 OCR
Posted: 2014-05-28T11:04:28-07:00
by fmw42
If on Linus/Mac OS or Windows with Cygwin, see my script textcleaner at the link below. Otherwise, see the IM function -lat
Re: Help Improving Text Of Scanned Image 4 OCR
Posted: 2014-05-28T11:05:14-07:00
by lindylex
It is on Debian.
Re: Help Improving Text Of Scanned Image 4 OCR
Posted: 2014-05-28T11:06:37-07:00
by fmw42
That is unix, correct, and thus my script can work on that OS.
Re: Help Improving Text Of Scanned Image 4 OCR
Posted: 2014-05-28T11:10:30-07:00
by fmw42
Before using the script, convert your PDF to a high resolution raster image such as PNG. Use -density to get high resolution
Code: Select all
convert -density XX image.pdf image.png
where XX is >72 such as 288 (which is 4x). If the resulting image is too big, then do
Code: Select all
convert -density XX image.pdf -resize YY image.png
where YY=25% or larger when XX=288
Or resize after using textcleaner
Re: Help Improving Text Of Scanned Image 4 OCR
Posted: 2014-05-28T11:19:02-07:00
by lindylex
fmw42, thanks for sharing this. I appreciate your hard work.
I tried 3 of the following commands on your site. This is the best on I got so far. Any sugeestion from looking at the pdf?
Code: Select all
./textcleaner -g -e stretch -f 25 -o 5 -s 1 out.pdf t9.png
Re: Help Improving Text Of Scanned Image 4 OCR
Posted: 2014-05-28T14:50:25-07:00
by fmw42
You did not use my suggestion of converting the pdf to png with -density before processing. Try this
Code: Select all
convert -density 288 out.pdf out1.png
textcleaner -g -e stretch -f 50 -o 10 -s 1 out1.png out1_f50_o10.png
convert out1_f50_o10.png -resize 25% out1_f50_o10_r25.png
or
Code: Select all
convert -density 288 out.pdf miff:- |\
textcleaner -g -e stretch -f 50 -o 10 -s 1 - miff:- |\
convert - -resize 25%out1_f50_o10_r25.png
Adjust the 25% as desired for the final size. The density 288 makes the out1.png about 4 times larger (higher quality). Add any other arguments you want to the textcleaner.
If you separate the two pages, you can use -deskew 40 to unrotate them so the lines are more even. If the pages are split before textcleaner, the use -u in textcleaner to do the unrotate.