Page 1 of 1

PDF is Lying to me!

Posted: 2014-07-22T13:40:21-07:00
by RobO
Hi,
I think this is really a question about PDFs. These have been produced in large numbers by a scanner in Singapore; I am in London. The odd thing is that the observed quality as seen by a PDF viewer is very high. I can see fingerprints on till receipts. But when I convert them to BMP using "convert" the quality is very low. 12pt text is unreadable.

I assume that convert should follow the quality of the source image by default. If I give it a "-density" qualifier all is good. So could the pdf contain header information with the wrong density?

The interesting thing is that we are using some software which identifies barcodes in the scans and converts the file to a new PDF with the barcode number as a name. Unortunately as it does so it loses all quality and not surprisingly fails to find the barcode. I think it this is being fooled as well.

I need to fix the PDFs so that the barcode reader can process the images.

A five page example around 400kb.
C:\>identify HiQ.PDF
HiQ.PDF[0] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.047u 0:00.045
HiQ.PDF[1] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.031u 0:00.031
HiQ.PDF[2] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.031u 0:00.031
HiQ.PDF[3] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.016u 0:00.031
HiQ.PDF[4] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.016u 0:00.031
HiQ.PDF[5] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.000u 0:00.016

This is an A4 pdf with five pages. It is way more than 595x842 pixels. Looking at it it could be a photo.

If I convert it to bit maps without specifying a density I get:
C:\>convert HiQ.PDF t.bmp
C:\>identify t-*.bmp
t-0.bmp BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.016
t-1.bmp[1] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.015
t-2.bmp[2] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.016
t-3.bmp[3] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.016
t-4.bmp[4] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.015
t-5.bmp[5] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.016

Now I know this has converted to 8 bit so it should be slightly lower quality. But it's actually almost unreadable.

If I convert it specifying -density 400 I get lovely (huge) images.
C:\>identify t.tif
t.tif[0] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.016u 0:00.047
t.tif[1] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.047u 0:00.047
t.tif[2] TIFF 3308x4680 3308x4680+0+0 16-bit sRGB 102.6MB 0.031u 0:00.031
t.tif[3] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.031u 0:00.046
t.tif[4] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.031u 0:00.046
t.tif[5] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.031u 0:00.046

Unfortunately I have many thousands of these images. (OK, I've been putting off the problem...) Are there any shortcuts to correct the PDF headers?

I can fix them with
convert -density 400 HiQ.pdf fixed.pdf
But it will take a while. I just tried 6 PDFs each around 500kb. It took 20 minutes on a 16Gb 3.4Ghz Xeon!

Any pointers on optimising the command line, globbing file names and the like? Ideally convert should process each PDF and create a fixed_?.PDF.
I also have some monochrome TIFs from the same scanner. These are showing
C:\>identify SIN.TIF
SIN.TIF TIFF 1654x2339 1654x2339+0+0 1-bit Bilevel Gray 43.9KB 0.016u 0:00.014
Which is what I suspect the PDFs actually contain!

Thanks,
Rob.

Re: PDF is Lying to me!

Posted: 2014-07-22T14:48:41-07:00
by snibgo
I don't know why people put scanned images into PDFs. PDF is a useful format when making documents to be printed, and not good for much else. In particular, it is a pain when further processing is needed.

I think you have figured out the problem and solution: "-density N". The trick is to find the best N. Scanners often work in multiples of 150, so you might find that 300 or 450 is best. With luck, all your documents need the same setting.

Re: PDF is Lying to me!

Posted: 2014-07-23T02:51:23-07:00
by RobO
Thanks for the envouragement. Will try different densities to see if I can speed the process up. Otherwise it will run for a whole weekend.
Re PDFs. One odd thing about these business scanners is that if you set them to TIF they produce one file per page. Which is odd as TIFs certainly can have multiple pages. Now in this business process the page order is important - the barcode is on the first page of the invoice and there can be a whole days invoices in one scan. So if you lose the order you end up with a mess. Now the scanner names the TIFs Date+pageN.TIF which looks good except that it doesn't zero pad the page number! Windows explorer fudges this but our processing software doesn't. Hence the PDFs.

Any pointers on batch processing? I'd really like to create one pdf for each pdf I have to process.
So rather than:
convert *.PDF HUGE.pdf
I can:
convert *.PDF new_*.PDF
such that :
ABC.pdf -> new_ABC.pdf
ABD.pdf -> new_ABD.pdf
Or even place the new images in a different folder.

Re: PDF is Lying to me!

Posted: 2014-07-23T09:55:02-07:00
by fmw42
convert *.PDF new_*.PDF
Convert will not do this. You would have to write a loop over each input pdf.

But you can use mogrify to process a whole folder of PDFs and make one output for each input (and even put them in one different output folder). See http://www.imagemagick.org/Usage/basics/#mogrify