This posting outlines a possible workflow for getting digitized versions of Notre Dame’s Catholic pamphlets into the “Catholic Portal”.
The University of Notre Dame owns a significant number of Catholic pamphlets. These materials have been cataloged and denoted as destined for the “Portal” in their MARC records with the letters “CRRA” in field 590$u.
The University’s library wants to digitize these materials, make the resulting PDF files freely available on the Web, apply optical character recognition against the PDF files, and support a text mining interface against the result. Bits and pieces of this work have already been done. The problem is gluing them together into functional workflow.
Continue reading “Catholic pamphlets and the “Catholic Portal””
This is the quickest of blog postings outlining how I am initially providing a text mining interface to digitized Catholic pamphlets.
Jean McManus used a scanner to create PDF versions of a few Catholic pamphlets. Along the way, she also had the software to a bit of OCR. She then gave the PDF documents to me with filenames matching MARC 001 fields.
I saved these files to our local file system and used the venerable pdftotext application to extract the plain text. I then hacked my locally harvested MARC records describing the given pamphlets with two additional URLs. One pointing to the local PDF file. Another pointing to a rudimentary text mining interface. Finally, I reindexed the MARC records making the URLs visible. There were only three edited records, and you can see the fruits of these labors here:
There are many things wrong with the implementation. The text mining interface points to invalid catalog records because they are hard-coded for University of Toronto content. The titles of the content include MARC field 245$c, but the older text mining interface did not expect this. Consequently, the title information for these newly added records is invalid. The PDF documents were scanned two pages at a time. This probably causes the extracted text to span both pages and thus invalidate every sentence. We will need to scan only one page per image to circumvent this problem.
Despite these difficulties, it is possible now to do a bit of analysis against the pamphlet, but there are many avenues for improvement. “Software is never done.”
This posting documents how I wrote and edited a couple of VUFind record drivers and Smarty templates for the “Portal” of the Catholic Research Resources Alliance. In writing this posting I hope to support any developer coming behind me as well as inform the wider open source community on how VUFind works.
Continue reading “VUFind record drivers and templates”