marii¨ | using ghostscript to covert washed-out pdfs for ocr processing

Using Ghostscript to Covert Washed-Out PDFs for OCR Processing

13 July 2020

I was given a batch of PDFs from a legacy journal site. The PDFs varied in length from 1 to 5 pages, included poetry and criticism (sometimes in multiple languages), and were low in contrast—i.e., the text was medium grey on a lighter grey background. Here’s a sample, for reference:

excerpt for greyscale reference

I wanted to OCR the pdfs, but when I ran them through my general-purpose pipeline (using vanilla Ghostscript to convert the PDFs to TIFs, then using Tesseract to extract text from the TIFs), the text was garbáge.

I looked at the generated TIFs to try and see if the problem was on the OCR side or the format conversion side. As you can see from the sample below, Tesseract never had a chance. The problem was with the conversion to TIF.

excerpt for greyscale reference

It took quite some time digging through Stackoverflow questions and reading Ghostscript(GS) docs before I found a working solution. There are so many ways to configure and optimize GS—it can get really overwhelming. So I thought I’d share and dissect the many parameters behind the command that eventually worked for me.

Without further ado, here’s the special sauce:

gs \
  -q \
  -dNOPAUSE \
  -dBATCH \
  -sDEVICE=pnggray \
  -g2550x3300 \
  -sOutputFile=$png_dir/%03d.png \
  $pdf

Breaking it down line by line:

gs: the ghostscript invocation name
-q: runs the command in 'quiet mode'; optional for our purposes
-dNOPAUSE: disables the prompt and pause at the end of each page. (see more here.) not relevant to our problem, but you're not going to want to interact with a prompt every time you convert a file in a large batch. so you'll want to add this.
-dBATCH: related to -dNOPAUSE. if you're running this as part of a script, just use it. trust me.
-sDEVICE=pnggray: sets which output device Ghostscript should use, in this case, pnggrey. (see more here and here.)
-g2550x3300: specifies the device width and height in pixels. you'll want to adjust these for your purposes and play around as necessary. (see more here.)
-sOutputFile=$png_dir/%03d.png: specifies which path(s) to write the resulting files to. in my case, a shell script uses a variable to $png_dir to streamline the placement. The file ends in .png because that's the conversion device i'm using and the result i want. %03d uses ghostscript's %d syntax for multipage documents, and specifically creates 001.png, 002.png, etc. as needed depending on the number of pages in the original PDF document. (the 03 specifies digit padding with zeroes)
$pdf: this is (absolute or relative path) to the path to the file you want to convert. in my case, $pdf is the variable for the path to each PDF file as globbed by my shell script. you can similarly use a variable or write out the path.