Page 1 of 1

Crop various scanned newspaper-pages

Posted: 2011-07-21T00:47:53-07:00
by steinsvik
Good day!
I am working on digitalizing a newspaper archive. Some guy has scanned thousands of pages from microfilm, and I'm taking the job of turning the scans into something useful.
The pages are readable, but many of the images have big black borders surrounding the pages. Normally the borders cover the right side and/or the bottom of the images. I'm wondering if there is any way to crop these black borders, so only the white newspaper-pages are left in the images. The borders are inconsistent and varies from image to image. I need a command that automatically recognizes the black borders, if any, and crops them away. We are talking about ~16,000 pages, so I need to make a batch script that goes through all the images automatically. That part I can handle myself, if I know the appropriate command for the 'convert' binary.

Here's an example of a scanned page:
Image

Re: Crop various scanned newspaper-pages

Posted: 2011-07-21T09:33:15-07:00
by fmw42
If the black goes to the edges, use
convert image -fuzz XX% -trim +repage result

adjust the XX% to leave the least amount of black aliasing along the border without "eating" away any of your text in the image. Hopefully that will not be a problem so long as there is some white between the black and the text.

with convert you could do multiple images, but IM will need to hold them all in memory.

or if you think the fuzz XX% is the same for all of them, you can put all the images in a folder and use mogrify

with mogrify, IM will process them one at a time, so I believe there is not a memory limitation as it finds each image as needed in the directory. Best to create new directory to hold the processed images so you don't overwrite you old ones with mistakes.

see
http://www.imagemagick.org/Usage/crop/#trim
http://www.imagemagick.org/Usage/basics/#mogrify
http://www.imagemagick.org/Usage/basics ... fy_convert
http://www.imagemagick.org/Usage/basics/#mogrify_not

Re: Crop various scanned newspaper-pages

Posted: 2011-07-21T17:28:40-07:00
by anthony
fmw42 wrote:If the black goes to the edges, use
convert image -fuzz XX% -trim +repage result
To make sure you only trim black add a black border to the image first!
As I point out in IM Examples on trim with specific color
http://www.imagemagick.org/Usage/crop/#trim_color
The operator is a little dumb and could trim not just the black edges but also some white edges depending on the order it actually uses.

Also by giving it a specific colored border yo uare specifying the exact center of the fuzz factor color selection. Without it the fuzz could select a 'scan noise' pixel, and thus select more colors than intended.

Adding a border first to remove a known color is thus recommended

Re: Crop various scanned newspaper-pages

Posted: 2011-10-22T06:05:39-07:00
by tom43
Hi
I have the same problem.
This command trim the right black border only and not the left:

Code: Select all

convert input.jpg -fuzz 70% -trim +repage output.jpg
How can remove the left black border?
Don't find a way to do it.
Thanks.
Tom

Re: Crop various scanned newspaper-pages

Posted: 2011-10-22T06:44:35-07:00
by tom43
Hi
I try this Trim or the 'Auto-Crop' Operator - http://www.imagemagick.org/Usage/crop/#trim

Code: Select all

convert input.jpg -trim +repage trim_repage.jpg
But is not a smart command. For instance, only works if the image have perfect black borders at both sides.
My page is not a perfect scanned image and have some black at the top and the image is not correctly straight then this command FAILS and don't do nothing.
I change the image on Photoshop, straight it manually and remove the top black border and redraw the black border to make it similar than the example at the link and then voilá! this command works.
Some thin black border are not remove... it works but not like the example. Need a good autocropping because is essential for users.
mm... I expect a better autocropping on ImageMagick... any idea to resolve this problem?
Thanks.
Tom

Re: Crop various scanned newspaper-pages

Posted: 2011-10-22T10:03:06-07:00
by fmw42
add -fuzz XX% to your command when the border is not a perfectly solid color and varies somewhat.

see
http://www.imagemagick.org/Usage/crop/#trim
http://www.imagemagick.org/Usage/crop/#trim_fuzz

Re: Crop various scanned newspaper-pages

Posted: 2011-10-22T11:09:58-07:00
by tom43
Thank you fmw42 for your reply but if your see two post up then can see my example using fuzz. In my opinion, after testing, ImageMagick is powerful for some tasks but is not a right tool for processing scanned pages like newspapers or books. Cropping is very poor. shave and trim also. For instances, take a book and scan two pages at the same time and then try to crop the black zone and you'll see. I'm disappointed with ImageMagick. Don't works also this (NN=value):

Code: Select all

convert xxx.tif –gravity center –crop NNxNN+0+0 xxx.jpg OR convert temple.tif –shave NNxNN shaved.jpg OR convert sample.png –crop NNxNN+NN+NN +repage cropped.png.
Setting coordinates don't works properly.
Very weak the cropping, trimming and shaving commands on ImageMagick and not easy to use.
How can do with ImageMagick this process named UNPAPER?
http://unpaper.berlios.de
I need this task and don't find how can do it with ImageMagick.
SIncerely yours,
Tom

Re: Crop various scanned newspaper-pages

Posted: 2011-10-22T18:15:28-07:00
by fmw42
post a link to an example of your images

Re: Crop various scanned newspaper-pages

Posted: 2011-10-22T22:39:10-07:00
by anthony
Trim is very sensitive. It only takes one 'noisy' pixel to make it fail.

One way to fix this is to blur the image before trimming, then look up the results to apply to the unblurred image. See Examples at
IM Examples, Trimming 'Noisy' Images -- Scanned or Video Images
http://www.imagemagick.org/Usage/crop/#trim_blur


Another way is to try and determine the bounds of the actual page (assuming it isn't rotated or skewed).

For example taking the first image (the full-sized one found on the page pointed to by the thumbnail),
you can compress the image vertically down to a single row of average pixels.

convert 931877.jpeg -scale 1600x1! row.png

Now if you do a profile of that image you will see that there is a distinct page bounds being generated.
im_profile is a script that uses gnuplot to plot the row of pixels, but I am only using it to see how the image behaves on across the image.
http://www.imagemagick.org/Usage/scripts/im_profile

Code: Select all

im_profile  row.png row_profile.png
Image
Note the sudden change from bright to mostly dark in the profile. That is the edge. It is the location of this edge that will give you your crop bounds left-to right across the image. Repeat this vertically and you get your page bounds regardless of how 'dirty' your border is. It is a much more accurate trim technique for aligned rectangles with extreme noise effects.

Re: Crop various scanned newspaper-pages

Posted: 2011-10-22T22:45:08-07:00
by anthony
Additional. Using a morphology operator such a dilate, or better still close, can be used to effectivally remove the text and thin lines from the image making determining page bounds easier.

Basic Morphology Operators
http://www.imagemagick.org/Usage/morphology/#basic
basically provides a much sharper boundary for page location determination than using blur would.

Use it with the Noisy Trim technique
http://www.imagemagick.org/Usage/crop/#trim_blur

Re: Crop various scanned newspaper-pages

Posted: 2012-01-09T03:34:45-07:00
by steinsvik
Thank you for the replies.

I'm still having problems grasping how to perform this task. The profiling of the image seems to be the right track, but I don't understand how to do this task command line. I've tried combinations of most of the trim, crop, fuzz arguments, but no luck. I'm sorry if I'm a bit dim, but this is a bit over my head. :oops:
If I find a way to run this operation on the command line (I'm using Linux), I will be able to make a PHP-script that churns through all 10.000++ pages and crop the black borders automatically.

This is the archive project I'm working on: http://46.137.172.165/
Click on a year to see the thumbnails, then click the thumbnails to download the full JPEG, or download the entire edition as PDF.
As you can see, the black borders are pretty nasty on some of the pages.. :(

Please help! :shock:

Re: Crop various scanned newspaper-pages

Posted: 2012-01-09T15:32:14-07:00
by fmw42
Here is a unix solution that follows Anthony's idea of averaging down to one row and one column and trimming the row and column with a fuzz factor. Then obtain the width and xoffset and the height an yoffset from the virtual canvas (page) geometry. The use those values to crop the image. However, the fuzz factors are going to be image dependent most likely, depending upon the (sort of) black borders and how black they really are and how uniform they really are.

Also as there seems to be bug in trimming 1 column images, I have had to rotate the image 90 degree.

Using your image from your first post at the top:


infile="931877.jpeg"
inname=`convert $infile -format "%t" info:`
convert $infile +repage -scale x1! -bordercolor black -border 1 -fuzz 30% -trim ${inname}_tmp1.png
width=`convert ${inname}_tmp1.png -format "%w" info:`
offsets=`convert ${inname}_tmp1.png -format "%O" info:`
xoff=`echo $offsets | cut -d+ -f2`
convert $infile +repage -rotate -90 -scale x1! -bordercolor black -border 1 -fuzz 60% -trim ${inname}_tmp2.png
height=`convert ${inname}_tmp2.png -format "%w" info:`
offsets=`convert ${inname}_tmp2.png -format "%O" info:`
yoff=`echo $offsets | cut -d+ -f2`
convert $infile -crop ${width}x${height}+${xoff}+${yoff} +repage ${inname}_crop.jpg


If on windows, I cannot help except to point you to http://www.imagemagick.org/Usage/windows/ as there are syntax differences in IM and I don't know Batch file scripting equivalents to the unix above.


EdIt: It seems that there is no bug. My display was not showing a long vertical 1 column image. So the above could be changed to the following to avoid the image rotation:



infile="931877.jpeg"

inname=`convert $infile -format "%t" info:`
convert $infile +repage -scale x1! -bordercolor black -border 1 -fuzz 30% -trim ${inname}_tmp1.png
width=`convert ${inname}_tmp1.png -format "%w" info:`
offsets=`convert ${inname}_tmp1.png -format "%O" info:`
xoff=`echo $offsets | cut -d+ -f2`
convert $infile +repage -scale 1x! -bordercolor black -border 1 -fuzz 60% -trim ${inname}_tmp2.png
height=`convert ${inname}_tmp2.png -format "%h" info:`
offsets=`convert ${inname}_tmp2.png -format "%O" info:`
yoff=`echo $offsets | cut -d+ -f3`
convert $infile -crop ${width}x${height}+${xoff}+${yoff} +repage ${inname}_crop.jpg

You will need to remove the two tmp files afterwards.

EDIT2: I would suggest you test a few images and set the fuzz value in both parts to as high a value as it can stand without cropping into the text part of your image. (In the above, I used the minimum fuzz values that would just make it work.)