PDF is Lying to me!
Posted: 2014-07-22T13:40:21-07:00
Hi,
I think this is really a question about PDFs. These have been produced in large numbers by a scanner in Singapore; I am in London. The odd thing is that the observed quality as seen by a PDF viewer is very high. I can see fingerprints on till receipts. But when I convert them to BMP using "convert" the quality is very low. 12pt text is unreadable.
I assume that convert should follow the quality of the source image by default. If I give it a "-density" qualifier all is good. So could the pdf contain header information with the wrong density?
The interesting thing is that we are using some software which identifies barcodes in the scans and converts the file to a new PDF with the barcode number as a name. Unortunately as it does so it loses all quality and not surprisingly fails to find the barcode. I think it this is being fooled as well.
I need to fix the PDFs so that the barcode reader can process the images.
A five page example around 400kb.
C:\>identify HiQ.PDF
HiQ.PDF[0] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.047u 0:00.045
HiQ.PDF[1] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.031u 0:00.031
HiQ.PDF[2] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.031u 0:00.031
HiQ.PDF[3] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.016u 0:00.031
HiQ.PDF[4] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.016u 0:00.031
HiQ.PDF[5] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.000u 0:00.016
This is an A4 pdf with five pages. It is way more than 595x842 pixels. Looking at it it could be a photo.
If I convert it to bit maps without specifying a density I get:
C:\>convert HiQ.PDF t.bmp
C:\>identify t-*.bmp
t-0.bmp BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.016
t-1.bmp[1] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.015
t-2.bmp[2] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.016
t-3.bmp[3] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.016
t-4.bmp[4] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.015
t-5.bmp[5] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.016
Now I know this has converted to 8 bit so it should be slightly lower quality. But it's actually almost unreadable.
If I convert it specifying -density 400 I get lovely (huge) images.
C:\>identify t.tif
t.tif[0] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.016u 0:00.047
t.tif[1] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.047u 0:00.047
t.tif[2] TIFF 3308x4680 3308x4680+0+0 16-bit sRGB 102.6MB 0.031u 0:00.031
t.tif[3] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.031u 0:00.046
t.tif[4] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.031u 0:00.046
t.tif[5] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.031u 0:00.046
Unfortunately I have many thousands of these images. (OK, I've been putting off the problem...) Are there any shortcuts to correct the PDF headers?
I can fix them with
convert -density 400 HiQ.pdf fixed.pdf
But it will take a while. I just tried 6 PDFs each around 500kb. It took 20 minutes on a 16Gb 3.4Ghz Xeon!
Any pointers on optimising the command line, globbing file names and the like? Ideally convert should process each PDF and create a fixed_?.PDF.
I also have some monochrome TIFs from the same scanner. These are showing
C:\>identify SIN.TIF
SIN.TIF TIFF 1654x2339 1654x2339+0+0 1-bit Bilevel Gray 43.9KB 0.016u 0:00.014
Which is what I suspect the PDFs actually contain!
Thanks,
Rob.
I think this is really a question about PDFs. These have been produced in large numbers by a scanner in Singapore; I am in London. The odd thing is that the observed quality as seen by a PDF viewer is very high. I can see fingerprints on till receipts. But when I convert them to BMP using "convert" the quality is very low. 12pt text is unreadable.
I assume that convert should follow the quality of the source image by default. If I give it a "-density" qualifier all is good. So could the pdf contain header information with the wrong density?
The interesting thing is that we are using some software which identifies barcodes in the scans and converts the file to a new PDF with the barcode number as a name. Unortunately as it does so it loses all quality and not surprisingly fails to find the barcode. I think it this is being fooled as well.
I need to fix the PDFs so that the barcode reader can process the images.
A five page example around 400kb.
C:\>identify HiQ.PDF
HiQ.PDF[0] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.047u 0:00.045
HiQ.PDF[1] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.031u 0:00.031
HiQ.PDF[2] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.031u 0:00.031
HiQ.PDF[3] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.016u 0:00.031
HiQ.PDF[4] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.016u 0:00.031
HiQ.PDF[5] PDF 595x842 595x842+0+0 16-bit sRGB 1.503MB 0.000u 0:00.016
This is an A4 pdf with five pages. It is way more than 595x842 pixels. Looking at it it could be a photo.
If I convert it to bit maps without specifying a density I get:
C:\>convert HiQ.PDF t.bmp
C:\>identify t-*.bmp
t-0.bmp BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.016
t-1.bmp[1] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.015
t-2.bmp[2] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.016
t-3.bmp[3] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.016
t-4.bmp[4] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.015
t-5.bmp[5] BMP 595x842 595x842+0+0 8-bit sRGB 1.506MB 0.016u 0:00.016
Now I know this has converted to 8 bit so it should be slightly lower quality. But it's actually almost unreadable.
If I convert it specifying -density 400 I get lovely (huge) images.
C:\>identify t.tif
t.tif[0] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.016u 0:00.047
t.tif[1] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.047u 0:00.047
t.tif[2] TIFF 3308x4680 3308x4680+0+0 16-bit sRGB 102.6MB 0.031u 0:00.031
t.tif[3] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.031u 0:00.046
t.tif[4] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.031u 0:00.046
t.tif[5] TIFF 3308x4680 3308x4680+0+0 1-bit Bilevel Gray 102.6MB 0.031u 0:00.046
Unfortunately I have many thousands of these images. (OK, I've been putting off the problem...) Are there any shortcuts to correct the PDF headers?
I can fix them with
convert -density 400 HiQ.pdf fixed.pdf
But it will take a while. I just tried 6 PDFs each around 500kb. It took 20 minutes on a 16Gb 3.4Ghz Xeon!
Any pointers on optimising the command line, globbing file names and the like? Ideally convert should process each PDF and create a fixed_?.PDF.
I also have some monochrome TIFs from the same scanner. These are showing
C:\>identify SIN.TIF
SIN.TIF TIFF 1654x2339 1654x2339+0+0 1-bit Bilevel Gray 43.9KB 0.016u 0:00.014
Which is what I suspect the PDFs actually contain!
Thanks,
Rob.