Convert PDF to text with pyPDF and PDFMiner: First Impression

In Software Engineering, Snippet

This quick post describes my initial experience with pyPDF and PDFMiner.

I’ve previously mentioned pyPDF in a post on counting the number of pages of each PDF in a directory full of PDFs. I have not used PDFMiner before, but saw it referenced many times when searching for PDF-to-text conversion Python libaries. From a glance at their respective documentations, pyPDF looked more suited toward PDF manipulation, and PDFMiner, toward text extraction.

Out of curiosity, I wanted to try both of them out for text extraction.

pyPDF

According to its documentation, pyPDF includes a text extraction method called extractText() in its PageObject class:

extractText() [#]
Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Stability: Added in v1.7, will exist for all future v1.x releases. May be overhauled to provide more ordered text in the future.

Returns:
a unicode string object

There are no code examples or samples for extractText() in the pyPDF documentation

The most basic way to extract text using pyPDF’s extractText() is:

import os
import glob
from pyPdf import PdfFileReader

parent = "D:/Projects/samples/extract-pdf-text"
os.chdir(parent)
filename = os.path.abspath('simple1.pdf')

input = PdfFileReader(file(filename, "rb"))
for page in input.pages:
    print page.extractText()

The PDF in this example is located at “D:\Projects\extract-pdf-text\samples\simple1.pdf,” which essentially reads as “Hello World.” This sample PDF is included with PDFMiner’s source, along with “simple2.pdf” (embedded images) and “simple3.pdf” (no visible objects in PDF). With simple1.pdf, simple2.pdf, and simple3.pdf, this code snippet returns an error: ” ValueError: invalid literal for int() with base 10: '>>' ” while trying to find the start of the PDF’s xref table.

I tried a few more PDFMiner samples with this snippet. With “dmca.pdf” (a secured read-only PDF), pyPDF returns the error: “Exception: file has not been decrypted“. With “naacl06-shinyama.pdf” (a typical research publication article), pyPDF returns the error: “UnicodeEncodeError: 'ascii' codec can't encode character u'\ufb01' in position 1933: ordinal not in range(128)

ActiveState recipe #511465 defines a wrapper function for this for loop:

### Reference
# 1. http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/

import os
import glob
import pyPdf

parent = "C:/Users/victoryee/Google Drive/Projects/extract-pdf-text"
os.chdir(parent)
filename = os.path.abspath('naacl06-shinyama.pdf')

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "/n"
    # Collapse whitespace
    content = " ".join(content.replace(u"/xa0", " ").strip().split())
    return content

# print getPDFContent(filename).encode("ascii", "ignore")
print getPDFContent(filename).encode("ascii", "xmlcharrefreplace")

The output from this code didn’t make much difference compared to the previous code snippet, except call an encode method at the end. This solves the UnicodeEncodeError during the processing of “naaclo6-shinyama.pdf”.

In any case, pyPDF is not format-aware, so the output looks like this:

PreemptiveInformationExtractionusingUnrestrictedRelationDiscoveryYusukeShinyamaSatoshiSekineNewYorkUniversity715,Broadway,7thFloorNewYork,NY,10003fyusuke,sekineg@cs.nyu.eduAbstractWearetryingtoextendtheboundaryofInformationExtraction(IE)systems.Ex-istingIEsystemsrequirealotoftimeandhumanefforttotuneforanewscenario.

PDFMiner

According to PDFMiner’s webpage,

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

The only command that is needed to use PDFMiner out-of-the-box is pdf2txt.py. Included command line examples include (from its documentation):

$ pdf2txt.py -o output.html samples/naacl06-shinyama.pdf
(extract text as an HTML file whose filename is output.html)

$ pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf
(extract a Japanese HTML file in vertical writing, CMap is required)

$ pdf2txt.py -P mypassword -o output.txt secret.pdf
(extract a text from an encrypted PDF file)

I liked that the source came with several sample PDFs, as well as their respective text, XML, and HTML conversions. I was able to reproduce all the samples using pdf2txt.py on the command line, except for “jo.pdf” and “kampo.pdf”, both of which used Asian script letters. (In fact, my Adobe PDF Reader couldn’t read these files natively without additional font package installation.)

Impressions

As expected, pyPDF (and maybe its fork pyPDF2) is better for PDF manipulation, and PDFMiner better for text conversion. There are also other PDF-to-text conversion or extraction tools, such as PDFBox (Java) that I might try in the future.

 

Leave a Reply