Objective: Get the number of pages for each PDF, in a directory of PDFs
Notes: I recently received a directory of over a hundred PDFs that were asked to be a certain number of pages at most. Opening and jotting down the number of pages for each PDF, by hand, would be tedious and time-consuming.
The first solution I tried was some random utility from tiff-tools.com, that I found from a quick Google search. It did the job for about 2/3 of my files, but outputted a page count of -1 for the remaining 1/3. Not sure exactly why — I suspected that perhaps the utility was too old and didn’t support some PDF versions, but many of my PDFs were version 1.5 (which was older than the utility).
Another Google search for Python code turned up snippet #496837 at ActiveState Recipes, which does the job for one PDF.
Some minor modifications allow this snippet to output the number of pages for multiple PDFs:
### References: # http://code.activestate.com/recipes/496837-count-pdf-pages/ # http://stackoverflow.com/questions/3249949/how-to-print-a-string-of-variables-without-spaces-in-python-minimal-coding import re import os import glob rxcountpages = re.compile(r"/Type\s*/Page([^s]|$)", re.MULTILINE|re.DOTALL) def count_pages(filename): data = file(filename,"rb").read() return len(rxcountpages.findall(data)) if __name__=="__main__": parent = "D:/2012-2013 Proposals" for infile in glob.glob(os.path.join(parent, '*.pdf')): page_count = str(count_pages(infile)) print "".join([infile, ",", page_count])
Note that print infile, ",", page_count
also works in Line 18, but Python will add spaces around each element. Using print "".join([infile, ",", page_count
eliminates these extra spaces.
To avoid outputting the entire absolute filepath for each PDF, os.path.join(parent, '*.pdf')
can be replaced with a line changing the working directory, then using os.path.basename('*.pdf')
:
### Reference: # http://code.activestate.com/recipes/496837-count-pdf-pages/ # http://stackoverflow.com/questions/3249949/how-to-print-a-string-of-variables-without-spaces-in-python-minimal-coding import re import os import glob rxcountpages = re.compile(r"/Type\s*/Page([^s]|$)", re.MULTILINE|re.DOTALL) def count_pages(filename): data = file(filename,"rb").read() return len(rxcountpages.findall(data)) if __name__=="__main__": parent = "D:/2012-2013 Proposals" os.chdir(parent) for infile in glob.glob(os.path.basename('*.pdf')): page_count = str(count_pages(infile)) print "".join([infile, ",", page_count])
An alternative method exists with the pyPdf library:
### References # 1. http://pybrary.net/pyPdf/ import os import glob from pyPdf import PdfFileReader parent = "D:/2012-2013 Proposals" os.chdir(parent) for infile in glob.glob(os.path.basename('*.pdf')): input = PdfFileReader(file(infile, "rb")) page_count = str(input.getNumPages()) print "".join([infile, ",", page_count])
The pyPdf library (which is mentioned in the ActiveState recipe) appears to be a pretty powerful library for manipulating PDFs, but is also much slower.
Usage
Finally, to interactively use this code to output to CSV, enter at the command line:
python count-pdf-pages-directory-nofullpath.py > pdf-pages.csv
One Comment
lovely, im using the first version.
is there a way to look inside subfolders as well?
i have multiple folders.
thanks