Saturday, August 19, 2006

Extract PDF title from all files on a directory

Got a directory full of PDF files with file names that have nothing to do with their title and want to generate a text listing ?

Try this Python script. You need to have pyPdf installed.


# pyPdf available at http://pybrary.net/pyPdf/
from pyPdf import PdfFileWriter, PdfFileReader
import os

for fileName in os.listdir('.'):
try:
if fileName.lower()[-3:] != "pdf": continue
input1 = PdfFileReader(file(fileName, "rb"))

# print the title of document1.pdf
print '##1', fileName, '##2', input1.getDocumentInfo().title
except:
print '##1', fileName, '##2'


Example output:


##1 00317565.pdf ##2 A framework for the specification of SCADA data links - Power Systems, IEEE Transactions on
##1 00363299.pdf ##2 Advanced SCADA concepts - IEEE Computer Applications in Power
##1 00392026.pdf ##2 Routing SCADA data through an enterprise WAN - IEEE Computer Applications in Power
##1 00500696.pdf ##2 INTEGRATION OF SCADA AND DA/DMS ACROSS A LARGE DISTRIBUTION SYSTEM - Energy Management and Power Delivery, 1995. Proceedings of EMPD '95., 1995 International Conferenc
##1 00515274.pdf ##2 THE DESIGN OF NEXT GENERATION SCADA SYSTEMS - Power Industry Computer Application Conference, 1995. Conference Proceedings., 1995 IEEE
##1 00517471.pdf ##2 THE ROLE OF MEDIUM ACCESS CONTROL PROTOCOLS IN SCADA SYSTEMS - Power Delivery, IEEE Transactions on

18 comments:

virens said...

That`s what I searched! Thank you very much, very usefull script!

Endy said...

Thanks a lot, what a good script and library!

Cartman said...

nice! but the problem is, many pdf files on the internet do not have the title information inside them! I tried this on my collection, but did not find even one pdf file with "real title"! of course, the dvi file name from which the pdf was created, that was printed out though!

daui said...

# modified script in order to rename files with titles
# pyPdf available at http://pybrary.net/pyPdf/

from pyPdf import PdfFileWriter, PdfFileReader
import os

for fileName in os.listdir('.'):
try:
if fileName.lower()[-3:] != "pdf": continue
input1 = PdfFileReader(file(fileName, "rb"))

# rename the document with it's title
os.rename(fileName,input1.getDocumentInfo().title + ".pdf")
except:
print '##1', fileName, '##2', input1.getDocumentInfo().title

Anonymous said...

Hi, pal.

I am facing exactly the same problem as Cartman's

But it’s far from a neat solution because many pdf creator won’t put real title as metadata that can be seen in native format or document property.

Is there any better way ?

Dexter

Anonymous said...

Hi,
This looks so useful, but I can't make it work. Really, because I don't know anything of programing. Please help (detailed help).
I put the pdf files in the python folder
I run Python Shell
I copied from the Blog the script and pasted it in the Python window
Nothing happened.
I don't know, do I need to change some names in the script according to my data?
Please help.

SeaCat said...

in the first and second code where you have given folder path ..iam trying to us ethis script ..it runs but wher eis output...iam new in paython

Ricardo N. Cabral said...

@SeaCat: you must run it from a command prompt:

c:\python25\python script.py

SeaCat said...

iam using pathon gui...but the problem is that scripts run and now error /or any output comes..:(...i have to extract titles and authors from reserach papers :(

SeaCat said...

from pyPdf import PdfFileWriter, PdfFileReader
import osfor fileName in os.listdir('.'):
try: if fileName.lower()[-3:] != "pdf":
continue input1 = PdfFileReader(file(fileName, "rb")) # print the title of document1.pdf
print '##1', fileName, '##2', input1.getDocumentInfo().title
except: print '##1', fileName, '##2'

>>> from pyPdf import PdfFileWriter, PdfFileReader
import osfor fileName in os.listdir('.'):
try: if fileName.lower()[-3:] != "pdf":
continue input1 = PdfFileReader(file(C:\Documents and Settings\dcs3ma\Desktop\18-06-08 meeting\software architecture ACM, "rb")) # print the title of document1.pdf
print '##1', fileName, '##2', input1.getDocumentInfo().title
except: print '##1', fileName, '##2'


in this script where i mention my folder path and how it will write output? in which file? i will make or it automaticaly...wher eit will make file ? how can i get that?

Anonymous said...

I modified the script a little bit to handle some errors:
# -*- coding: cp1252 -*-
# modified script in order to rename files with titles
# pyPdf available at http://pybrary.net/pyPdf/

from pyPdf import PdfFileWriter, PdfFileReader
import os

trgtfilename = ""

for fileName in os.listdir('.'):
if fileName.lower()[-3:] != "pdf": continue

actfile = file(fileName, "rb")
input1 = PdfFileReader(actfile)
try:
trgtfilename = input1.getDocumentInfo().title + "_" + fileName
except:
print "\n## ERROR ## %s Title could not be extracted. PDF file may be encrypted!" % fileName
continue
del input1
actfile.close()
print 'Trying to rename from:', fileName, '\n to ', trgtfilename
try:
os.rename(fileName,trgtfilename)
except:
print fileName, ' could not be renamed!'
print '\n## ERROR ## Maybe the filename already exists or the document is already opened!'

I hope you find it useful
Alex

Anonymous said...

this blogsite kills indentation and therefore those python scripts. make sure you indent correctly after copy & paste. to run the script type:

python my-pasted-text.py

in a shell / command window. with correct indentation it works nicely.

Anonymous said...

http://pastebin.com/Vk1UTn2p

python code with proper indentation

Anonymous said...

http://pastebin.com/CDmgbkPG

properly indented, else the Python does not work.

this one tries to keep unique names.

Anonymous said...

this script is quite useful and performant. when photoRec gives you just unique filenames, it is fast to use mc (midnite commander) and this script to get ure recovered drive back in somewhat shape.

SHARATH REDDY said...

Does any one know how to detect all the headings used in a pdf document?

Docear said...

If a JAVA tool is fine as well, you can also use our tool "Docear's PDF Inspector" www.docear.org/software/add-ons/docears-pdf-inspector/

Anonymous said...

Similar but with PyPDF2

http://codrspace.com/still-time/rename-pdf-files-with-extracted-title/