This is the quickest of blog postings outlining how I am initially providing a text mining interface to digitized Catholic pamphlets.
Jean McManus used a scanner to create PDF versions of a few Catholic pamphlets. Along the way, she also had the software to a bit of OCR. She then gave the PDF documents to me with filenames matching MARC 001 fields.
I saved these files to our local file system and used the venerable pdftotext application to extract the plain text. I then hacked my locally harvested MARC records describing the given pamphlets with two additional URLs. One pointing to the local PDF file. Another pointing to a rudimentary text mining interface. Finally, I reindexed the MARC records making the URLs visible. There were only three edited records, and you can see the fruits of these labors here:
There are many things wrong with the implementation. The text mining interface points to invalid catalog records because they are hard-coded for University of Toronto content. The titles of the content include MARC field 245$c, but the older text mining interface did not expect this. Consequently, the title information for these newly added records is invalid. The PDF documents were scanned two pages at a time. This probably causes the extracted text to span both pages and thus invalidate every sentence. We will need to scan only one page per image to circumvent this problem.
Despite these difficulties, it is possible now to do a bit of analysis against the pamphlet, but there are many avenues for improvement. “Software is never done.”