It seems to me that it might be time to document my database of test PDFs and various parsing results that I have had online for quite some time. This entry is an attempt to do that for the current state of play.
Adding a PDF to the collection
You'll need the pdfcollection code (tarball). This is all the code I used to manage the database.
Run the addpdf script to add a new PDF to the collection. Here's an example:
mikal@lapel:~/opensource/pdfcollection$ ./addpdf /home/mikal/pdfa/This\ is\ a\ sample\ PDF.pdf
Where is the PDF database? /home/pdfdb
Publish? (y/n) y
Adding /home/mikal/pdfa/This is a sample PDF.pdf
New object id is 649 (000649)
Moving PDF
Processing PDF
Extracting pages
Number of pages: 0
Extract info
mikal@lapel:~/opensource/pdfcollection$
The text in bold above is what I typed. The local copy of the PDF database on my machine is in /home/pdfdb/. This command relies on ghostscript, imagemagick, ghostview, and pdfinfo (from the xpdf utilities package) being installed, so make that happen. It displays the PDF with gv, makes sure you really want to add it (i.e. you own enough of the rights to the document to do so), and then does it's thing.
In this example, ghostscript failed to extract any pages from the document, which is a little sad.
But I don't want to view the document before adding it
Then use the --force flag to
addpdf and all will go well. Use a command line like this:
./addpdf doc.pdf --force
Recreating the page count and thumbnails for existing documents
A lot of the PDFs have been in the database for several years, and in that time I assume that ghostscript's ability to view PDF documents has hopefully improved. You can therefore easily regenerate the page count, thumbnails and metadata for a PDF document with the
processpdf command. This command was actually used by
addpdf under the hood. Let's give it a go:
mikal@lapel:~/opensource/pdfcollection$ for item in `ls /home/pdfdb/ | grep 0`
> do
> ./processpdf /home/pdfdb $item
> done
This simple script regenerated all of the metadata for all of the PDFs in the database, and hammered my machine while doing it. The command line arguments are the location of the PDF database, and the id number of the PDF to process.
This command has basically the same dependencies as the
addpdf command.
Conclusion
I've run out of things to say for now, but later I'll show you how to rerun the
pdfomatic regression tests.
Tags for this post: pdfdb(
)