Crop various scanned newspaper-pages

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
steinsvik
Posts: 2
Joined: 2011-07-21T00:36:00-07:00
Authentication code: 8675308

Crop various scanned newspaper-pages

Post by steinsvik »

Good day!
I am working on digitalizing a newspaper archive. Some guy has scanned thousands of pages from microfilm, and I'm taking the job of turning the scans into something useful.
The pages are readable, but many of the images have big black borders surrounding the pages. Normally the borders cover the right side and/or the bottom of the images. I'm wondering if there is any way to crop these black borders, so only the white newspaper-pages are left in the images. The borders are inconsistent and varies from image to image. I need a command that automatically recognizes the black borders, if any, and crops them away. We are talking about ~16,000 pages, so I need to make a batch script that goes through all the images automatically. That part I can handle myself, if I know the appropriate command for the 'convert' binary.

Here's an example of a scanned page:
Image
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Crop various scanned newspaper-pages

Post by fmw42 »

If the black goes to the edges, use
convert image -fuzz XX% -trim +repage result

adjust the XX% to leave the least amount of black aliasing along the border without "eating" away any of your text in the image. Hopefully that will not be a problem so long as there is some white between the black and the text.

with convert you could do multiple images, but IM will need to hold them all in memory.

or if you think the fuzz XX% is the same for all of them, you can put all the images in a folder and use mogrify

with mogrify, IM will process them one at a time, so I believe there is not a memory limitation as it finds each image as needed in the directory. Best to create new directory to hold the processed images so you don't overwrite you old ones with mistakes.

see
http://www.imagemagick.org/Usage/crop/#trim
http://www.imagemagick.org/Usage/basics/#mogrify
http://www.imagemagick.org/Usage/basics ... fy_convert
http://www.imagemagick.org/Usage/basics/#mogrify_not
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: Crop various scanned newspaper-pages

Post by anthony »

fmw42 wrote:If the black goes to the edges, use
convert image -fuzz XX% -trim +repage result
To make sure you only trim black add a black border to the image first!
As I point out in IM Examples on trim with specific color
http://www.imagemagick.org/Usage/crop/#trim_color
The operator is a little dumb and could trim not just the black edges but also some white edges depending on the order it actually uses.

Also by giving it a specific colored border yo uare specifying the exact center of the fuzz factor color selection. Without it the fuzz could select a 'scan noise' pixel, and thus select more colors than intended.

Adding a border first to remove a known color is thus recommended
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
tom43
Posts: 3
Joined: 2011-10-22T05:40:26-07:00
Authentication code: 8675308

Re: Crop various scanned newspaper-pages

Post by tom43 »

Hi
I have the same problem.
This command trim the right black border only and not the left:

Code: Select all

convert input.jpg -fuzz 70% -trim +repage output.jpg
How can remove the left black border?
Don't find a way to do it.
Thanks.
Tom
tom43
Posts: 3
Joined: 2011-10-22T05:40:26-07:00
Authentication code: 8675308

Re: Crop various scanned newspaper-pages

Post by tom43 »

Hi
I try this Trim or the 'Auto-Crop' Operator - http://www.imagemagick.org/Usage/crop/#trim

Code: Select all

convert input.jpg -trim +repage trim_repage.jpg
But is not a smart command. For instance, only works if the image have perfect black borders at both sides.
My page is not a perfect scanned image and have some black at the top and the image is not correctly straight then this command FAILS and don't do nothing.
I change the image on Photoshop, straight it manually and remove the top black border and redraw the black border to make it similar than the example at the link and then voilá! this command works.
Some thin black border are not remove... it works but not like the example. Need a good autocropping because is essential for users.
mm... I expect a better autocropping on ImageMagick... any idea to resolve this problem?
Thanks.
Tom
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Crop various scanned newspaper-pages

Post by fmw42 »

add -fuzz XX% to your command when the border is not a perfectly solid color and varies somewhat.

see
http://www.imagemagick.org/Usage/crop/#trim
http://www.imagemagick.org/Usage/crop/#trim_fuzz
tom43
Posts: 3
Joined: 2011-10-22T05:40:26-07:00
Authentication code: 8675308

Re: Crop various scanned newspaper-pages

Post by tom43 »

Thank you fmw42 for your reply but if your see two post up then can see my example using fuzz. In my opinion, after testing, ImageMagick is powerful for some tasks but is not a right tool for processing scanned pages like newspapers or books. Cropping is very poor. shave and trim also. For instances, take a book and scan two pages at the same time and then try to crop the black zone and you'll see. I'm disappointed with ImageMagick. Don't works also this (NN=value):

Code: Select all

convert xxx.tif –gravity center –crop NNxNN+0+0 xxx.jpg OR convert temple.tif –shave NNxNN shaved.jpg OR convert sample.png –crop NNxNN+NN+NN +repage cropped.png.
Setting coordinates don't works properly.
Very weak the cropping, trimming and shaving commands on ImageMagick and not easy to use.
How can do with ImageMagick this process named UNPAPER?
http://unpaper.berlios.de
I need this task and don't find how can do it with ImageMagick.
SIncerely yours,
Tom
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Crop various scanned newspaper-pages

Post by fmw42 »

post a link to an example of your images
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: Crop various scanned newspaper-pages

Post by anthony »

Trim is very sensitive. It only takes one 'noisy' pixel to make it fail.

One way to fix this is to blur the image before trimming, then look up the results to apply to the unblurred image. See Examples at
IM Examples, Trimming 'Noisy' Images -- Scanned or Video Images
http://www.imagemagick.org/Usage/crop/#trim_blur


Another way is to try and determine the bounds of the actual page (assuming it isn't rotated or skewed).

For example taking the first image (the full-sized one found on the page pointed to by the thumbnail),
you can compress the image vertically down to a single row of average pixels.

convert 931877.jpeg -scale 1600x1! row.png

Now if you do a profile of that image you will see that there is a distinct page bounds being generated.
im_profile is a script that uses gnuplot to plot the row of pixels, but I am only using it to see how the image behaves on across the image.
http://www.imagemagick.org/Usage/scripts/im_profile

Code: Select all

im_profile  row.png row_profile.png
Image
Note the sudden change from bright to mostly dark in the profile. That is the edge. It is the location of this edge that will give you your crop bounds left-to right across the image. Repeat this vertically and you get your page bounds regardless of how 'dirty' your border is. It is a much more accurate trim technique for aligned rectangles with extreme noise effects.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: Crop various scanned newspaper-pages

Post by anthony »

Additional. Using a morphology operator such a dilate, or better still close, can be used to effectivally remove the text and thin lines from the image making determining page bounds easier.

Basic Morphology Operators
http://www.imagemagick.org/Usage/morphology/#basic
basically provides a much sharper boundary for page location determination than using blur would.

Use it with the Noisy Trim technique
http://www.imagemagick.org/Usage/crop/#trim_blur
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
steinsvik
Posts: 2
Joined: 2011-07-21T00:36:00-07:00
Authentication code: 8675308

Re: Crop various scanned newspaper-pages

Post by steinsvik »

Thank you for the replies.

I'm still having problems grasping how to perform this task. The profiling of the image seems to be the right track, but I don't understand how to do this task command line. I've tried combinations of most of the trim, crop, fuzz arguments, but no luck. I'm sorry if I'm a bit dim, but this is a bit over my head. :oops:
If I find a way to run this operation on the command line (I'm using Linux), I will be able to make a PHP-script that churns through all 10.000++ pages and crop the black borders automatically.

This is the archive project I'm working on: http://46.137.172.165/
Click on a year to see the thumbnails, then click the thumbnails to download the full JPEG, or download the entire edition as PDF.
As you can see, the black borders are pretty nasty on some of the pages.. :(

Please help! :shock:
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Crop various scanned newspaper-pages

Post by fmw42 »

Here is a unix solution that follows Anthony's idea of averaging down to one row and one column and trimming the row and column with a fuzz factor. Then obtain the width and xoffset and the height an yoffset from the virtual canvas (page) geometry. The use those values to crop the image. However, the fuzz factors are going to be image dependent most likely, depending upon the (sort of) black borders and how black they really are and how uniform they really are.

Also as there seems to be bug in trimming 1 column images, I have had to rotate the image 90 degree.

Using your image from your first post at the top:


infile="931877.jpeg"
inname=`convert $infile -format "%t" info:`
convert $infile +repage -scale x1! -bordercolor black -border 1 -fuzz 30% -trim ${inname}_tmp1.png
width=`convert ${inname}_tmp1.png -format "%w" info:`
offsets=`convert ${inname}_tmp1.png -format "%O" info:`
xoff=`echo $offsets | cut -d+ -f2`
convert $infile +repage -rotate -90 -scale x1! -bordercolor black -border 1 -fuzz 60% -trim ${inname}_tmp2.png
height=`convert ${inname}_tmp2.png -format "%w" info:`
offsets=`convert ${inname}_tmp2.png -format "%O" info:`
yoff=`echo $offsets | cut -d+ -f2`
convert $infile -crop ${width}x${height}+${xoff}+${yoff} +repage ${inname}_crop.jpg


If on windows, I cannot help except to point you to http://www.imagemagick.org/Usage/windows/ as there are syntax differences in IM and I don't know Batch file scripting equivalents to the unix above.


EdIt: It seems that there is no bug. My display was not showing a long vertical 1 column image. So the above could be changed to the following to avoid the image rotation:



infile="931877.jpeg"

inname=`convert $infile -format "%t" info:`
convert $infile +repage -scale x1! -bordercolor black -border 1 -fuzz 30% -trim ${inname}_tmp1.png
width=`convert ${inname}_tmp1.png -format "%w" info:`
offsets=`convert ${inname}_tmp1.png -format "%O" info:`
xoff=`echo $offsets | cut -d+ -f2`
convert $infile +repage -scale 1x! -bordercolor black -border 1 -fuzz 60% -trim ${inname}_tmp2.png
height=`convert ${inname}_tmp2.png -format "%h" info:`
offsets=`convert ${inname}_tmp2.png -format "%O" info:`
yoff=`echo $offsets | cut -d+ -f3`
convert $infile -crop ${width}x${height}+${xoff}+${yoff} +repage ${inname}_crop.jpg

You will need to remove the two tmp files afterwards.

EDIT2: I would suggest you test a few images and set the fuzz value in both parts to as high a value as it can stand without cropping into the text part of your image. (In the above, I used the minimum fuzz values that would just make it work.)
Post Reply