I wrote a PDF scan clean up & optimisation tool using Bash, GNU Parallel, ImageMagick convert and Python Tkinter

Daniel MasonJun 10, 2023September 25, 20230

Clean up & optimise PDFs for e-readers, tablet and print (Hundreds of pages in batch) Unrotate, contrast threshold, crop artifacts, trim whitespace

Anybody working with scanned documents faces this issue: Scanned PDFs are often low contrast, contain scanning artifacts such as book comb bindings, and are skewed and rotated somehow. This not only makes them difficult to read on e-reader and tablet devices but also reduces the size of usable content on printed A4 pages.

Common problems with scanned documents

I have been printing many classical music pieces from IMSLP.com which offers a collection of sheet music in the public domain. This included graded sight-reading course books with hundreds of pages of PDF scans. These contain the following issues:

Advertisement Begins

Advertisement End

As you can see, these problem areas not only waste printer toner, they also reduce the quality of printed material and create eye strain and readability issues. For some time now I wanted to write a script to correct these problems as it is fairly straight forward using image manipulation tools like ImageMagick and Bash tools like parallel to take care of efficiently executing the same operation on hundreds of image files. Each problem area above can be solved with the corresponding step listed here:

Crop unwanted scanning artifacts such as book comb bindings by defining a crop margin using a GUI
Crop white space borders to increase the size of usable content on tablets/e-readers
Deskew crooked pages (straightens up the page and reduces unnecessary margins)
Increase contrast by applying a black threshold to remove grey background
Convert image to true black and white to save on printer ink/toner and increase readability)
Sharpen text

Depending on the quality of original PDF, this allowed me to increase the size of music staves by 10.7% on the final printed page, while also increasing readability due to maximum contrast and saving printer toner by removing unwanted/distracting scanning artifacts.

YouTube Video

The example PDF file in the video is of terrible quality to begin with… and I wouldn’t normally bother with such a file (there are typically many alternative versions for the same piece on IMSLP.com. It’s just a matter of finding a better quality PDF scan and then applying the series of commands to get a final output that is high contrast, cropped, straightened and sharpened.

Main Benefits of this solution

The main differentiator of using this solution:

Scalability to hundreds of pages – this process scales to hundreds of pages well
Reduce waiting/processing time by using available CPU and memory resources optimally (multi threading)
Completely local tool – no upload to online servers required
Supports different page layouts (e.g. facing pages as in booklet, or single plage layouts and any custom arrangement as well, see GUI options in next section)
Higher cropping accuracy because of the ability to manually tweak individual pages and apply settings in bulk.
Free of charge

The Python Tknter Graphical User Interface

Now stating the obvious: I really did not spend much time on this UI…. as soon as I got it working for the intended use case I stopped enhancing it. If there is enough interest I might clean it up. For now it offers basic but powerful functionality for manipulating PDF scans for any computer running Python.

Apply crop borders using arrow keys for top, bottom, left and right margins
Use SHIFT + arrow keys to reduce margins
Save time by applying the same crop margins to many files in bulk
- All pages, Even pages, Odd pages, Remaining pages in range, uncropped pages
Save crop settings to CSV file

Now, it might surprise you to hear me (an automation addict) say this, however, for me it is worth spending 15 minutes manually tweaking a file if that means the final printed booklet is easier to read, cheaper to print and good-looking in general. I am planning to spend a lot of time learning, playing and enjoying these music pieces. It really is no comparison to spending that little bit of time beforehand cleaning the documents.

MY MISSION

This blog started nearly 10 years ago to help me document my technical adventures in home automation and various side projects. Since then, my audience has grown significantly thanks to readers like you.

While blog content can be incredibly valuable to visitors, it’s difficult for bloggers to capture any of that value – and we still have to work for a living too. There are many ways to support my efforts should you choose to do so:

Consider joining my newsletter or shouting a coffee to help with research, drafting, crafting and publishing of new content or the costs of web hosting.

It would mean the world if gave my Android App a go or left a 5-star review on Google Play. You may also participate in feature voting to shape the apps future.

Alternatively, leave the gift of feedback, visit my Etsy Store or share a post you liked with someone who may be interested. All helps spread the word.

BTC network: 32jWFfkMQQ6o4dJMpiWVdZzSwjRsSUMCk6

Using Bash Unix tools for this purpose

Part of my interest in this exercise is to play with the parallel command and learn about ImageMagick’s image manipulation options. Here is a table of the commands used and their purpose.

Command	Purpose	Instruction
split.sh	Bash script to determine the number of pages in your input PDF and split it into smaller chunks (5 pages each) determined by the BATCH_SIZE variable at the top of the file. This is to help us optimise the batch run later using `parallel` to have multiple chunks processed at the same time.	You can edit the BATCH_SIZE variable to adjust based on your system resources. I found 5 pages per batch to be sufficient.
gs	Used for PDF operations (splitting a file)
convert	We are using ImageMagick’s convert command to all the PDF to PNG conversion and image manipulation (including cropping, deskewing, unrotating, thresholding, sharpening, trimming and converting back to PDF.) I carefully tested and optimised the input parameters to get the optimal outcome.	If your PDFs come out too dark, try adjusting the -threshold parameters. I found 80% to be a good value (any pixel darker than 80% grey will be set to pure black. Any pixel below that threshold is set to pure white. This removes the gray background in the scanned image and increases contrast.
parallel	GNU `parallel` is a very useful command for parallelising multiple instances of other commands. We are using it to run multiple instances of the aforementioned `convert` command concurrently to make better use of system resources and speed up the conversion process	The `-j` parameter determines the max number of subprocesses to use. My server has a 6 core/12 thread CPU and I could 6 or 7 sub-processes depending on the task. To avoid stalling your machine, I suggest to start with a value of 2.
python show.py	Custom Python GUI which loads all the PNG files in `out` directory and provides a graphical interface for setting crop borders.	Use arrow keys to increase crop borders on left, right, top and bottom. To reduce border, hold down shift and press the corresponding arrow key. You can apply the same settings to multiple pages using the provided buttons. The text fields are not editable. When you are done removing scanning artifacts from images, pres the save to file button. This saves all your settings to a file called `cropargs.csv`

Kitchen Multi-Timer Pro

Now you’re cooking

Multi Timer Pro is your ultimate meal prep companion, keeping track of multiple cooking times and making adjustments on the fly. Give it a try today and become a better home cook!

Split.sh File

I included the split.sh file below. It splits a PDF into smaller PDFs based on the batch size (5 pages max per file). It gives each file a helpful name showing the included page range (e.g. pages 1–5, 5-10, 11-15 etc)

BATCH_SIZE=5
TOTAL=$((`pdfinfo input.pdf | grep -- ^Pages | tr -dc '[0-9]'`))

for i in `seq 1 $BATCH_SIZE $TOTAL`
do 
j=$((i+$BATCH_SIZE-1))

gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dFirstPage=$i -dLastPage=$j -sOutputFile=pdfsplit/output_$i-$j.pdf input.pdf
done

How to clean up scanned PDFs using my tool

Time needed: 15 minutes.

Download the repository
Copy your PDF file to the root of the repository and rename it to `input.pdf`
Split the PDF
Split the input PDF into multiple PDFs each containing a small batch of pages (This is for parallelisation and limited memory use during batch processing)
Command: ./split.sh
Output: pdfsplit directory containing PDFs with 5 pages each
Convert PDFs to PNG files
Split each PDF into individual 5 PNG files
Command:

parallel -j 7 convert -density 600 {} -set filename:f '%t_%p' out/'out-%[filename:f]_orig.png' ::: *pdfsplit/*

Output: `out` directory containing raw PNG files
Use Python UI to remove scanning artifacts
Manually remove unwanted scanning artifacts (such as comb/spiral binding). It is not necessary to crop white borders as that will be done automatically. Remove unwanted scanning artifacts in the top/bottom/left/right page margins.
Command: python show.py
Output: cropargs.csv – containing the individual crop settings for each page
Crop and Optimise PNG files using ImageMagick `convert` tool

Apply the crop and optimise the scanned page using ImageMagick’s convert tool.
This command runs convert in parallel, applying the following operations
1. apply the manual crop to remove scanning artifacts
2. Convert the image to black and white by applying a threshold value (any pixel 20% in brightness will be converted to black)
3. Adjust image rotation to straighten the scanned page
4. Automatically crop the scanned page by removing white borders.
Command:

cat cropargs.csv | parallel -j 6 --colsep ',' convert -crop {1} +repage -threshold 80% -deskew 40% -trim {2} {3}

Output: *_final.png file for each scanned page
Convert the final PNG files to PDFs

Command:

ls out/*final.png | parallel -j 4 "convert -density 600 -page 'a4<' {} {}.pdf"
Merge individual files using PDFSam

Use PDFsam to merge the PDFs into a single PDF (this is the most performant solution. Using `convert` uses 20+GB of memory and stalled my system. An alternative would be to batch the PDF creation by creating multiple 10-page PDFs at a time using `parallel` and then combing those. I found simply downloading PDF sam for this task the easiest solution.

ImageMagick Policies

I did have to tweak ImageMagick settings to increase memory and CPU usage and enable PDF editing. Simply uncomment or edit the following rows in /etc/ImageMagick-6/policy.xml. Adjust based on your system’s resources.

The commands run multiple instances of ImageMagick concurrently in order to speed up the processing.

<policy domain="resource" name="memory" value="2GiB"/>
<policy domain="resource" name="map" value="2GiB"/>
<policy domain="resource" name="disk" value="10GiB"/>
<!-- comment out the line below to enable PDF editing -->
<!-- <policy domain="coder" rights="none" pattern="PDF" /> -->

Common problems with scanned documents

YouTube Video

Main Benefits of this solution

The Python Tknter Graphical User Interface

MY MISSION

Using Bash Unix tools for this purpose

Kitchen Multi-Timer Pro

Split.sh File

How to clean up scanned PDFs using my tool

ImageMagick Policies

MealPrepTimer: The Ultimate Kitchen Timer App for Home Cooks (Update: Timer Assignment UI Overhaul)

Docker “Server Misbehaving” – Root Cause Analysis in 3 Acts.

Related posts

Troubleshooting Intermittent WiFi Issues: Solving “Host Unreachable, No IP Route” Error on Android and NUC Devices

How to Optimize Docker Builds with Nexus OSS for Apt, Maven, Docker and NPM Dependencies

Troubleshooting Asus Xonar U7: Blinking LED and Connectivity Issues