Get number of pages from multiple PDFs in a directory, with Python

Objective: Get the number of pages for each PDF, in a directory of PDFs

Notes: I recently received a directory of over a hundred PDFs that were asked to be a certain number of pages at most. Opening and jotting down the number of pages for each PDF, by hand, would be tedious and time-consuming.

The first solution I tried was some random utility from tiff-tools.com, that I found from a quick Google search. It did the job for about 2/3 of my files, but outputted a page count of -1 for the remaining 1/3. Not sure exactly why — I suspected that perhaps the utility was too old and didn’t support some PDF versions, but many of my PDFs were version 1.5 (which was older than the utility).

Another Google search for Python code turned up snippet #496837 at ActiveState Recipes, which does the job for one PDF.

Some minor modifications allow this snippet to output the number of pages for multiple PDFs:

### References:
# http://code.activestate.com/recipes/496837-count-pdf-pages/
# http://stackoverflow.com/questions/3249949/how-to-print-a-string-of-variables-without-spaces-in-python-minimal-coding

import re
import os
import glob

rxcountpages = re.compile(r"/Type\s*/Page([^s]|$)", re.MULTILINE|re.DOTALL)

def count_pages(filename):
    data = file(filename,"rb").read()
    return len(rxcountpages.findall(data))

if __name__=="__main__":
    parent = "D:/2012-2013 Proposals"
    for infile in glob.glob(os.path.join(parent, '*.pdf')):
        page_count = str(count_pages(infile))
        print "".join([infile, ",", page_count])

Note that print infile, ",", page_count also works in Line 18, but Python will add spaces around each element. Using print "".join([infile, ",", page_count eliminates these extra spaces.

To avoid outputting the entire absolute filepath for each PDF, os.path.join(parent, '*.pdf') can be replaced with a line changing the working directory, then using os.path.basename('*.pdf'):

### Reference:
# http://code.activestate.com/recipes/496837-count-pdf-pages/
# http://stackoverflow.com/questions/3249949/how-to-print-a-string-of-variables-without-spaces-in-python-minimal-coding

import re
import os
import glob

rxcountpages = re.compile(r"/Type\s*/Page([^s]|$)", re.MULTILINE|re.DOTALL)

def count_pages(filename):
    data = file(filename,"rb").read()
    return len(rxcountpages.findall(data))

if __name__=="__main__":
    parent = "D:/2012-2013 Proposals"
    os.chdir(parent)
    for infile in glob.glob(os.path.basename('*.pdf')):
        page_count = str(count_pages(infile))
        print "".join([infile, ",", page_count])

An alternative method exists with the pyPdf library:

### References
# 1. http://pybrary.net/pyPdf/

import os
import glob
from pyPdf import PdfFileReader

parent = "D:/2012-2013 Proposals"
os.chdir(parent)
for infile in glob.glob(os.path.basename('*.pdf')):
    input = PdfFileReader(file(infile, "rb"))
    page_count = str(input.getNumPages())
    print "".join([infile, ",", page_count])

The pyPdf library (which is mentioned in the ActiveState recipe) appears to be a pretty powerful library for manipulating PDFs, but is also much slower.

Usage

Finally, to interactively use this code to output to CSV, enter at the command line:

python count-pdf-pages-directory-nofullpath.py > pdf-pages.csv

Get number of pages from multiple PDFs in a directory, with Python

One Comment

Leave a Reply Cancel reply

One Comment

Leave a Reply Cancel reply

Related Posts

Move a row of data from one worksheet to another based on cell value, with Google Apps Script

Format a mixed column of T, G, M, K postfixed data, in Excel

Import and auto-update RSS feed, and keep history, in Google Spreadsheets

Programmatically delete existing time-based triggers and create a new one, in Google Apps Script