The primary purpose of this posting is to document some of my experiences with OAI and VuFind. Specifically it outlines a sort of “recipe” I use to import OAI content into the “Catholic Portal“. The recipe includes a set of “ingredients”, site-specific commands. Towards the end, I ruminate on the use of OAI and Dublin Core for the sharing of metadata.
When I learn of a new OAI repository containing metadata destined for the Portal, I use the following recipe to complete the harvesting/indexing process:
- Use the OAI protocol directly to browse the remote data repository – This requires a slightly in-depth understanding how OAI-PMH functions, and describing it any additional detail is beyond the scope of this posting. Please consider perusing the OAI specification itself.
- Create a list of sets to harvest – This is like making a roux and is used to configure the oai.ini file, next.
- Edit/configure harvesting via oai.ini and properties files – The VuFind oai.ini file denotes the repositories to harvest from as well as some pretty cool configuration directives governing the harvesting process. Whomever wrote the harvester for VuFind did a very good job. Kudos!
- Harvest a set – The command for this step is in the list of ingredients, below. Again, this is very-well written.
- Edit/configure indexing via an XSL file – This is the most difficult part of the process. It requires me to write XSL, which is not too difficult in and of itself, but since each set of OAI content is often different from every other set, the XSL is set specific. Moreover, the metadata of the set is often incomplete, inconsistent, or ambiguous making the indexing process a challenge. In another post, it would behoove me to include a list of XSL routines I seem to use from repository to repository, but again, each repository is different.
- Test XSL output for completeness – The command for this step is below.
- Go to Step #5 until done – In this case “done” is usually defined as “good enough”.
- Index set – Our raison d’être, and the command is given below.
- Go to Step #4 for all sets – Each repository may include many sets, which is a cool OAI feature.
- Harvest and index all sets – Enhance the Portal.
- Go to Step #10 on a regular basis – OAI content is expected to evolve over time.
- Go to Step #1 on a less regular basis – Not only does content change, but the way it is described evolves as well. Harvesting and indexing is a never-ending process.
I use the following Linux “ingredients” to help me through the process of harvesting and indexing. I initialize things with a couple of environment variables. I use full path names whenever possible because I don’t know where I will be in the file system, and the VUFIND_HOME environment variable sometimes gets in the way. Ironic.
# configure; first the name of the repository and then a sample metadata file
rm -rf /usr/local/vufind2/local/harvest/$NAME/*.delete
rm -rf /usr/local/vufind2/local/harvest/$NAME/*
# delete; an unfinished homemade Perl script to remove content from Solr
# harvest; do the first part of the work
cd /usr/local/vufind2/harvest/; php harvest_oai.php $NAME
# test XSL output
cd /usr/local/vufind2/import; \
php ./import-xsl.php --test-only \
# index; do the second part of the work
/usr/local/vufind2/harvest/batch-import-xsl.sh $NAME $NAME.properties
Using the recipe and these ingredients, I am usually able to harvest and index content from a new repository a few hours. Of course, it all depends on the number of sets in the repository, the number of items in each set, as well as the integrity metadata itself.
As I have alluded to in a previous blog posting, the harvesting and indexing of OAI content is not straight-forward. In my particular case, the software is not to blame. No, the software is very well-written. I don’t take advantage of all of the software’s features though, but that is only because I do not desire to introduce any “-isms” into my local implementation. Specifically, I do not desire to mix PHP code with my XSL routines. Doing so seems too much like Fusion cuisine.
The challenge in this process is both the way Dublin Core is used, as well as the data itself. For example, is a PDF document a type of text? Sometimes it is denoted that way. There are dates in the metadata, but the dates are not qualified. Date published? Date created? Date updated? Moreover, the dates are syntactically different: 1995, 1995-01-12, January 1995. My software is stupid and/or I don’t have the time to normalize everything for each and every set. Then there are subjects. Sometimes they are Library of Congress headings. Sometimes they are just keywords. Sometimes there are multiple subjects in the metadata and they are enumerated in one field delimited by various characters. Sometimes these multiple subject “headings” are manifested as multiple dc.subject elements. Authors (creators) present a problem. First name last? Last name first? Complete with birth and death dates? Identifiers? Ack! Sometimes they include unique codes — things akin to URIs. Cool! Sometimes identifiers are URLs, but most of the time, these URLs point to splash pages of content management systems. Rarely do the identifiers point the item actually described by the metadata. And then there out & out errors. For example, description elements containing URLs pointing to image files.
Actually, none of this is new. Diane Hillmann & friends encountered all of these problems on a much grander scale through the National Science Foundation’s desire to create a “digital library”. Diane’s entire blog — Metadata Matters — is a cookbook for resolving these issues, but in my way of boiling everything done to their essentials, the solution is two-fold: 1) mutual agreements on how to manifest metadata, and 2) the writing of more intelligent software on my part.