|Time to document my PDF testing database|
It seems to me that it might be time to document my database of test PDFs and various parsing results that I have had online for quite some time. This entry is an attempt to do that for the current state of play.
Adding a PDF to the collection
You'll need the pdfcollection code (tarball). This is all the code I used to manage the database.
Run the addpdf script to add a new PDF to the collection. Here's an example:
mikal@lapel:~/opensource/pdfcollection$ ./addpdf /home/mikal/pdfa/This\ is\ a\ sample\ PDF.pdf Where is the PDF database? /home/pdfdb Publish? (y/n) y Adding /home/mikal/pdfa/This is a sample PDF.pdf New object id is 649 (000649) Moving PDF Processing PDF Extracting pages Number of pages: 0 Extract info mikal@lapel:~/opensource/pdfcollection$
The text in bold above is what I typed. The local copy of the PDF database on my machine is in /home/pdfdb/. This command relies on ghostscript, imagemagick, ghostview, and pdfinfo (from the xpdf utilities package) being installed, so make that happen. It displays the PDF with gv, makes sure you really want to add it (i.e. you own enough of the rights to the document to do so), and then does it's thing.
In this example, ghostscript failed to extract any pages from the document, which is a little sad.
But I don't want to view the document before adding it
Then use the --force flag to addpdf and all will go well. Use a command line like this:
./addpdf doc.pdf --force
Recreating the page count and thumbnails for existing documents
A lot of the PDFs have been in the database for several years, and in that time I assume that ghostscript's ability to view PDF documents has hopefully improved. You can therefore easily regenerate the page count, thumbnails and metadata for a PDF document with the processpdf command. This command was actually used by addpdf under the hood. Let's give it a go:
mikal@lapel:~/opensource/pdfcollection$ for item in `ls /home/pdfdb/ | grep 0` > do > ./processpdf /home/pdfdb $item > done
This simple script regenerated all of the metadata for all of the PDFs in the database, and hammered my machine while doing it. The command line arguments are the location of the PDF database, and the id number of the PDF to process.
This command has basically the same dependencies as the addpdf command.
I've run out of things to say for now, but later I'll show you how to rerun the pdfomatic regression tests.
Tags for this post: pdfdb pdf database test document
Related posts: Updated pdfdb; Wanted: PDF documents; Expect to see some sample documents soon; PDF/A; PDF/A sample documents?; Nova database continuous integration
posted at: 22:21 | path: /pdfdb | permanent link to this entry