I was given a batch of PDFs from a legacy journal site. The PDFs varied in length from 1 to 5 pages, included poetry and criticism (sometimes in multiple languages), and were low in contrast—i.e., the text was medium grey on a lighter grey background. Here’s a sample, for reference:
I wanted to OCR the pdfs, but when I ran them through my general-purpose pipeline (using vanilla Ghostscript to convert the PDFs to TIFs, then using Tesseract to extract text from the TIFs), the text was garbáge.
I looked at the generated TIFs to try and see if the problem was on the OCR side or the format conversion side. As you can see from the sample below, Tesseract never had a chance. The problem was with the conversion to TIF.
It took quite some time digging through Stackoverflow questions and reading Ghostscript(GS) docs before I found a working solution. There are so many ways to configure and optimize GS—it can get really overwhelming. So I thought I’d share and dissect the many parameters behind the command that eventually worked for me.
Without further ado, here’s the special sauce:
gs \ -q \ -dNOPAUSE \ -dBATCH \ -sDEVICE=pnggray \ -g2550x3300 \ -sOutputFile=$png_dir/%03d.png \ $pdf
Breaking it down line by line:
- the ghostscript invocation name
- runs the command in 'quiet mode'; optional for our purposes
- disables the prompt and pause at the end of each page. (see more here.) not relevant to our problem, but you're not going to want to interact with a prompt every time you convert a file in a large batch. so you'll want to add this.
- related to
-dNOPAUSE. if you're running this as part of a script, just use it. trust me.
- sets which output device Ghostscript should use, in this case,
pnggrey. (see more here and here.)
- specifies the device width and height in pixels. you'll want to adjust these for your purposes and play around as necessary. (see more here.)
- specifies which path(s) to write the resulting files to. in my case, a shell script uses a variable to
$png_dirto streamline the placement. The file ends in
.pngbecause that's the conversion device i'm using and the result i want.
%dsyntax for multipage documents, and specifically creates 001.png, 002.png, etc. as needed depending on the number of pages in the original PDF document. (the
03specifies digit padding with zeroes)
- this is (absolute or relative path) to the path to the file you want to convert. in my case,