Without undue difficulty I have been able to harvest metadata from a ContentDM site via OAI-PMH, index the data in Solr, and successfully search & retrieve this metadata in VuFind all for the “Catholic Portal”. This posting outlines how I did this and why it is important.
The content of the “Portal” is expected to be rare, infrequently held, and uncommon. More often than not, this type of material is held in library special collections and archives. Increasingly, this same material is digitized and stored in some sort of digital repository. Any repository worth its weight in salt supports some sort of API (application programmer interface) allowing computer programs to harvest and use the underlying metadata. ContentDM is one such repository application, and OAI-PMH (Open Archives Initiative – Protocol for Metadata Harvesting) is one such API.
Duquesne University recently became a memer of the CRRA (Catholic Research Resources Alliance), and it is my job to make their metadata stored in ContentDM a part of the Portal. The balance of this posting describes how I did that.
The latest and greatest version of VuFind comes with a PHP-based utility to harvest content from OAI-PMH data providers. Using it is simple enough. Edit a configuration file. Run a program. Metadata (XML files) appear in a local directory. The utility is smart enought to keep track of harvest dates so OAI-PMH deletes and updates can be managed easily. For more detail see the section on OAI harvesting on the VuFind wiki.
VuFind comes with an second PHP-based utility to index the harvested metadata. Using it requires the developer to write XSLT files, edit another configuration file, and run the program. But since my PHP skill are not nearly as strong as my Perl skills, and since I had previously indexed other XML files in a different manner, I decided not to use the PHP indexer.
My implementation is based my previously written EAD indexing routines and described in “Indexing MARC and EAD in VUFind with Solr for the CRRA“. In a nutshell, the script:
- reads each harvested metadata file file
- maps the Dublin Core metdata to VuFind/Solr schema fields
- feeds the metadata to Solr
More specifically, I mapped the following Dublin Core elements to Solr schema fields like this:
- contributor -> author2
- creator -> author, author_letter
- date -> publishDate
- description -> description
- format -> format
- language -> language
- publisher -> publisher
- subject -> topic
- title -> title, title_auth, title_full, title_fullStr, title_full_unstemmed, title_short, title_sort
- type -> type
I populated additional Solr schema fields in different ways. Allfields is a concatonation of all the Dublin Core metadata elements. Fullrecord is a tiny XML file of my own design, similar to the one I created in the EAD implementation. Institution and building are presently hard-coded into the script but will later be pulled from a database containing all CRRA members. RecordType was filled with “oaidc”. Finally, the location of the remote digital object (dc:identifier) is inserted into the URL element of the fullrecord field.
Once the mapping is done a Perl WebService::Solr document object is created, filled with the metadata, and posted to Solr. The script is called oai_dc-index.pl and available for your perusal.
The final step was to write a VuFind record driver for the new record type — oaidc. The coding for this was trivial since much of the work had been done for the EAD files. I copied EadRecord.php to OaidcRecord.php, and changed the names of a couple of classes. These minor tweaks enable me to display Duquesne’s name, library, and URLs in VuFind.
The end result is a set of five additional records in the Portal , all pointing and providing access to digitized content from Duquesne University’s Gumberg Library.
The implementation is not perfect.
First of all, each of the five digitized items in Duquesne’s ContentDM implementation are books. All of the pages in the books are accessible individually, and each page has metadata associated with it. Unfortunately, the metadata is meager. Consequently I needed to delete hundreds of metatadata records from the OAI-PMH harvest and retain only the book-level metadata.
Second, the script currently includes a number of hard-coded characterisitcs, but when other OAI-PMH data repositories become available these hard-coded characterisitcs will be generalized.
Why is the important? There are a number of reasons. I believe a few of our CRRA members have ContentDM implementations. Harvesting and indexing their metadata will not only make the Portal richer, but it will also make it easier for students, teachers, and researchers to access the full text of the materials online.