It seems to me that it might be time to document my database of test PDFs and various parsing results that I have had online for quite some time. This entry is an attempt to do that for the current state of play.
Adding a PDF to the collection
You'll need the pdfcollection code (tarball). This is all the code I used to manage the database.
Run the addpdf script to add a new PDF to the collection. Here's an example:
mikal@lapel:~/opensource/pdfcollection$ ./addpdf /home/mikal/pdfa/This\ is\ a\ sample\ PDF.pdf
Where is the PDF database? /home/pdfdb
Publish? (y/n) y
Adding /home/mikal/pdfa/This is a sample PDF.pdf
New object id is 649 (000649)
Number of pages: 0
The text in bold above is what I typed. The local copy of the PDF database on my machine is in /home/pdfdb/. This command relies on ghostscript, imagemagick, ghostview, and pdfinfo (from the xpdf utilities package) being installed, so make that happen. It displays the PDF with gv, makes sure you really want to add it (i.e. you own enough of the rights to the document to do so), and then does it's thing.
In this example, ghostscript failed to extract any pages from the document, which is a little sad.
But I don't want to view the document before adding it
Then use the --force flag to addpdf
and all will go well. Use a command line like this:
./addpdf doc.pdf --force
Recreating the page count and thumbnails for existing documents
A lot of the PDFs have been in the database for several years, and in that time I assume that ghostscript's ability to view PDF documents has hopefully improved. You can therefore easily regenerate the page count, thumbnails and metadata for a PDF document with the processpdf
command. This command was actually used by addpdf
under the hood. Let's give it a go:
mikal@lapel:~/opensource/pdfcollection$ for item in `ls /home/pdfdb/ | grep 0`
> ./processpdf /home/pdfdb $item
This simple script regenerated all of the metadata for all of the PDFs in the database, and hammered my machine while doing it. The command line arguments are the location of the PDF database, and the id number of the PDF to process.
This command has basically the same dependencies as the addpdf
I've run out of things to say for now, but later I'll show you how to rerun the pdfomatic
Tags for this post: pdfdb pdf database test document