Time to document my PDF testing database

    It seems to me that it might be time to document my database of test PDFs and various parsing results that I have had online for quite some time. This entry is an attempt to do that for the current state of play.

    Adding a PDF to the collection

    You'll need the pdfcollection code (tarball). This is all the code I used to manage the database.

    Run the addpdf script to add a new PDF to the collection. Here's an example:

    mikal@lapel:~/opensource/pdfcollection$ ./addpdf /home/mikal/pdfa/This\ is\ a\ sample\ PDF.pdf
    Where is the PDF database? /home/pdfdb
    Publish? (y/n) y
    Adding /home/mikal/pdfa/This is a sample PDF.pdf
    New object id is 649 (000649)
    Moving PDF
    Processing PDF
    Extracting pages
    Number of pages: 0
    Extract info

    The text in bold above is what I typed. The local copy of the PDF database on my machine is in /home/pdfdb/. This command relies on ghostscript, imagemagick, ghostview, and pdfinfo (from the xpdf utilities package) being installed, so make that happen. It displays the PDF with gv, makes sure you really want to add it (i.e. you own enough of the rights to the document to do so), and then does it's thing.

    In this example, ghostscript failed to extract any pages from the document, which is a little sad.

    But I don't want to view the document before adding it

    Then use the --force flag to addpdf and all will go well. Use a command line like this:

    ./addpdf doc.pdf --force

    Recreating the page count and thumbnails for existing documents

    A lot of the PDFs have been in the database for several years, and in that time I assume that ghostscript's ability to view PDF documents has hopefully improved. You can therefore easily regenerate the page count, thumbnails and metadata for a PDF document with the processpdf command. This command was actually used by addpdf under the hood. Let's give it a go:

    mikal@lapel:~/opensource/pdfcollection$ for item in `ls /home/pdfdb/ | grep 0`          
    > do
    >   ./processpdf /home/pdfdb $item 
    > done

    This simple script regenerated all of the metadata for all of the PDFs in the database, and hammered my machine while doing it. The command line arguments are the location of the PDF database, and the id number of the PDF to process.

    This command has basically the same dependencies as the addpdf command.


    I've run out of things to say for now, but later I'll show you how to rerun the pdfomatic regression tests.

    Tags for this post: pdfdb pdf database test document
    Related posts: Updated pdfdb; Wanted: PDF documents; Expect to see some sample documents soon; PDF/A; PDF/A sample documents?; Nova database continuous integration

posted at: 22:21 | path: /pdfdb | permanent link to this entry