Page 1 of 1

Multicropping Dictionary Entries (Based on Whitespace)

Posted: 2012-08-14T23:55:23-07:00
by rramphal
Hi Everyone,

This forum is amazing and I am hoping that someone will be able to help me out with this project:

I have a set of scans of a dictionary. Here are four examples (I left them as links because they are all quite large):
http://img189.imageshack.us/img189/9287/00011y.png
http://img854.imageshack.us/img854/29/03022.png
http://img826.imageshack.us/img826/2843/07452.png
http://img443.imageshack.us/img443/7585/09922.png

I want to run a script on all the files that would multicrop them to their individual entries. Ideally, the image would be split according to the red lines below and be named in a sequential manner (eg. 0001.1-01.png, 0001.1-02.png, ..., 0001.1-13.png) so that in the end, I would have a set of images, each with its own dictionary entry.

Image

I had the idea that I could replace rows of consecutive white pixels with one line and then crop from there (like viewtopic.php?f=1&t=20766). I also found this topic: viewtopic.php?f=1&t=16041 which is similar; however, the particular issue with this project is that there is also whitespace between lines within an entry. It seems as though the difference in the whitespace heights is enough to separate just the entries and not the lines, but I'm not sure how to start. I hope that this is clear enough. I would appreciate any ideas or suggestions!

Thanks!
Ravi

Re: Multicropping Dictionary Entries (Based on Whitespace)

Posted: 2012-08-15T10:42:41-07:00
by fmw42
Are you able to insert the red lines yourself? If so then the first reference you gave will provide the solution. If not, then note that each text section starts with some bold characters that start to the left side of the image. None of the following text lies in that region. So you should be able to use the first few columns to locate those bold characters. -scale the first few columns down to one column and look for the begining of dark regions that will define those characters. Then allow for half the distance between the characters and the bottom of the line just above them. That will then give you the Y coordinates for the crops and the width of the image is the X coordinate. Loop over each Y coordinate extracted from the dark areas of the column and do your crops appropriately.

Re: Multicropping Dictionary Entries (Based on Whitespace)

Posted: 2012-08-16T15:02:24-07:00
by rramphal
Wow ― that is so ingenious! Thank you so much Fred! I'll try it out and see if I can get it to work.