Anybody working with scanned documents faces this issue: Scanned PDFs are often low contrast, contain scanning artifacts such as book comb bindings, and are skewed and rotated somehow. This not only makes them difficult to read on e-reader and tablet devices but also reduces the size of usable content on printed A4 pages.
Common problems with scanned documents
I have been printing many classical music pieces from IMSLP.com which offers a collection of sheet music in the public domain. This included graded sight-reading course books with hundreds of pages of PDF scans. These contain the following issues:
Advertisement Begins
Advertisement End
As you can see, these problem areas not only waste printer toner, they also reduce the quality of printed material and create eye strain and readability issues. For some time now I wanted to write a script to correct these problems as it is fairly straight forward using image manipulation tools like ImageMagick and Bash tools like parallel
to take care of efficiently executing the same operation on hundreds of image files. Each problem area above can be solved with the corresponding step listed here:
- Crop unwanted scanning artifacts such as book comb bindings by defining a crop margin using a GUI
- Crop white space borders to increase the size of usable content on tablets/e-readers
- Deskew crooked pages (straightens up the page and reduces unnecessary margins)
- Increase contrast by applying a black threshold to remove grey background
- Convert image to true black and white to save on printer ink/toner and increase readability)
- Sharpen text
Depending on the quality of original PDF, this allowed me to increase the size of music staves by 10.7% on the final printed page, while also increasing readability due to maximum contrast and saving printer toner by removing unwanted/distracting scanning artifacts.
YouTube Video
The example PDF file in the video is of terrible quality to begin with… and I wouldn’t normally bother with such a file (there are typically many alternative versions for the same piece on IMSLP.com. It’s just a matter of finding a better quality PDF scan and then applying the series of commands to get a final output that is high contrast, cropped, straightened and sharpened.
Main Benefits of this solution
The main differentiator of using this solution:
- Scalability to hundreds of pages – this process scales to hundreds of pages well
- Reduce waiting/processing time by using available CPU and memory resources optimally (multi threading)
- Completely local tool – no upload to online servers required
- Supports different page layouts (e.g. facing pages as in booklet, or single plage layouts and any custom arrangement as well, see GUI options in next section)
- Higher cropping accuracy because of the ability to manually tweak individual pages and apply settings in bulk.
- Free of charge
The Python Tknter Graphical User Interface
Now stating the obvious: I really did not spend much time on this UI…. as soon as I got it working for the intended use case I stopped enhancing it. If there is enough interest I might clean it up. For now it offers basic but powerful functionality for manipulating PDF scans for any computer running Python.
- Apply crop borders using arrow keys for top, bottom, left and right margins
- Use SHIFT + arrow keys to reduce margins
- Save time by applying the same crop margins to many files in bulk
- All pages, Even pages, Odd pages, Remaining pages in range, uncropped pages
- Save crop settings to CSV file
Now, it might surprise you to hear me (an automation addict) say this, however, for me it is worth spending 15 minutes manually tweaking a file if that means the final printed booklet is easier to read, cheaper to print and good-looking in general. I am planning to spend a lot of time learning, playing and enjoying these music pieces. It really is no comparison to spending that little bit of time beforehand cleaning the documents.
MY MISSION
This blog started nearly 10 years ago to help me document my technical adventures in home automation and various side projects. Since then, my audience has grown significantly thanks to readers like you.
While blog content can be incredibly valuable to visitors, it’s difficult for bloggers to capture any of that value – and we still have to work for a living too. There are many ways to support my efforts should you choose to do so:
Consider joining my newsletter or shouting a coffee to help with research, drafting, crafting and publishing of new content or the costs of web hosting.
It would mean the world if gave my Android App a go or left a 5-star review on Google Play. You may also participate in feature voting to shape the apps future.
Alternatively, leave the gift of feedback, visit my Etsy Store or share a post you liked with someone who may be interested. All helps spread the word.
BTC network: 32jWFfkMQQ6o4dJMpiWVdZzSwjRsSUMCk6
Using Bash Unix tools for this purpose
Part of my interest in this exercise is to play with the parallel
command and learn about ImageMagick’s image manipulation options. Here is a table of the commands used and their purpose.
Command | Purpose | Instruction |
split.sh | Bash script to determine the number of pages in your input PDF and split it into smaller chunks (5 pages each) determined by the BATCH_SIZE variable at the top of the file. This is to help us optimise the batch run later using parallel to have multiple chunks processed at the same time. | You can edit the BATCH_SIZE variable to adjust based on your system resources. I found 5 pages per batch to be sufficient. |
gs | Used for PDF operations (splitting a file) | |
convert | We are using ImageMagick’s convert command to all the PDF to PNG conversion and image manipulation (including cropping, deskewing, unrotating, thresholding, sharpening, trimming and converting back to PDF.) I carefully tested and optimised the input parameters to get the optimal outcome. | If your PDFs come out too dark, try adjusting the -threshold parameters. I found 80% to be a good value (any pixel darker than 80% grey will be set to pure black. Any pixel below that threshold is set to pure white. This removes the gray background in the scanned image and increases contrast. |
parallel | GNU parallel is a very useful command for parallelising multiple instances of other commands. We are using it to run multiple instances of the aforementioned convert command concurrently to make better use of system resources and speed up the conversion process | The -j parameter determines the max number of subprocesses to use. My server has a 6 core/12 thread CPU and I could 6 or 7 sub-processes depending on the task. To avoid stalling your machine, I suggest to start with a value of 2. |
python show.py | Custom Python GUI which loads all the PNG files in out directory and provides a graphical interface for setting crop borders. | Use arrow keys to increase crop borders on left, right, top and bottom. To reduce border, hold down shift and press the corresponding arrow key. You can apply the same settings to multiple pages using the provided buttons. The text fields are not editable. When you are done removing scanning artifacts from images, pres the save to file button. This saves all your settings to a file called cropargs.csv |
Kitchen Multi-Timer Pro
Now you’re cooking
Multi Timer Pro is your ultimate meal prep companion, keeping track of multiple cooking times and making adjustments on the fly. Give it a try today and become a better home cook!
Split.sh File
I included the split.sh
file below. It splits a PDF into smaller PDFs based on the batch size (5 pages max per file). It gives each file a helpful name showing the included page range (e.g. pages 1–5, 5-10, 11-15 etc)
BATCH_SIZE=5
TOTAL=$((`pdfinfo input.pdf | grep -- ^Pages | tr -dc '[0-9]'`))
for i in `seq 1 $BATCH_SIZE $TOTAL`
do
j=$((i+$BATCH_SIZE-1))
gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dFirstPage=$i -dLastPage=$j -sOutputFile=pdfsplit/output_$i-$j.pdf input.pdf
done
How to clean up scanned PDFs using my tool
Time needed: 15 minutes.
- Download the repository
- Copy your PDF file to the root of the repository and rename it to `input.pdf`
- Split the PDF
Split the input PDF into multiple PDFs each containing a small batch of pages (This is for parallelisation and limited memory use during batch processing)
Command:./split.sh
Output:pdfsplit
directory containing PDFs with 5 pages each - Convert PDFs to PNG files
Split each PDF into individual 5 PNG files
Command:parallel -j 7 convert -density 600 {} -set filename:f '%t_%p' out/'out-%[filename:f]_orig.png' ::: *pdfsplit/*
Output: `out` directory containing raw PNG files - Use Python UI to remove scanning artifacts
Manually remove unwanted scanning artifacts (such as comb/spiral binding). It is not necessary to crop white borders as that will be done automatically. Remove unwanted scanning artifacts in the top/bottom/left/right page margins.
Command:python show.py
Output: cropargs.csv – containing the individual crop settings for each page - Crop and Optimise PNG files using ImageMagick `convert` tool
Apply the crop and optimise the scanned page using ImageMagick’s
convert
tool.
This command runsconvert
in parallel, applying the following operations
1. apply the manual crop to remove scanning artifacts
2. Convert the image to black and white by applying a threshold value (any pixel 20% in brightness will be converted to black)
3. Adjust image rotation to straighten the scanned page
4. Automatically crop the scanned page by removing white borders.
Command:cat cropargs.csv | parallel -j 6 --colsep ',' convert -crop {1} +repage -threshold 80% -deskew 40% -trim {2} {3}
Output:*_final.png
file for each scanned page - Convert the final PNG files to PDFs
Command:ls out/*final.png | parallel -j 4 "convert -density 600 -page 'a4<' {} {}.pdf"
- Merge individual files using PDFSam
Use PDFsam to merge the PDFs into a single PDF (this is the most performant solution. Using `convert` uses 20+GB of memory and stalled my system. An alternative would be to batch the PDF creation by creating multiple 10-page PDFs at a time using `parallel` and then combing those. I found simply downloading PDF sam for this task the easiest solution.
ImageMagick Policies
I did have to tweak ImageMagick settings to increase memory and CPU usage and enable PDF editing. Simply uncomment or edit the following rows in /etc/ImageMagick-6/policy.xml
. Adjust based on your system’s resources.
The commands run multiple instances of ImageMagick concurrently in order to speed up the processing.
<policy domain="resource" name="memory" value="2GiB"/>
<policy domain="resource" name="map" value="2GiB"/>
<policy domain="resource" name="disk" value="10GiB"/>
<!-- comment out the line below to enable PDF editing -->
<!-- <policy domain="coder" rights="none" pattern="PDF" /> -->