Indexing MARC and EAD in VUFind with Solr for the CRRA

This posting outlines how I am currently indexing MARC and EAD files in VUFind with Solr for the CRRA. (Boy, there are a lot of acronyms in that sentence!)

Background

The Catholic Research Resources Alliance (CRRA) is a member-driven organization with the purpose of making available “rare, unique, and uncommon” research materials for Catholic scholarship. Presently the membership is primarily made up of libraries and archives who pool together their metadata records, have them indexed, and provide access to the index. My responsibility is to build and maintain the technical infrastructure supporting this endeavor.

Continue reading “Indexing MARC and EAD in VUFind with Solr for the CRRA”

EAD @ Marquette 4 CRRA

This is the briefest of travelogues reporting on a meeting about EAD files at Marquette University for the Catholic Research Resources Alliance on September 20, 2010.

marquette sights

A few members of the Alliance were previously awarded a CLIR grant to catalog previously uncataloged special collections items. These members are now doing the work using EAD (Encoded Archive Description) with the intent of sharing the resulting metadata with the “Catholic Portal”. The purposes of the meeting were to build relationships between these particular Alliance members and to discuss progress on the grant. In attendance where people from St. Catherine University (Deborah Kloiber and Emily Asch), Marquette University (Matt Blessing, Ann Hanlon, Bill Fliss, and Jean Zanoni), and the University of Notre Dame (Pat Lawton, Kevin Cawley, and Eric Lease Morgan).

Continue reading “EAD @ Marquette 4 CRRA”

index-ead.pl

Today I indexed some of the metadata I extracted yesterday using a script called index-ead.pl. Of all the scripts I’ve written so far, this one is the most straight-forward. Read locally-developed XML file. Extract the unique identifier, title, and date. Associate each with VUFind/Solr fields. Commit.

You can (temporarily) see the fruits of these labors because all of the records have been associated with the Eric Lease Morgan Foo Bar Library. The result is a list of container-level records with very little additional information.

By the way, as of today I am running a version of VUFind as retrieved from the development trunk, specifically, revision 3029. When upgrading from revision to revision, it is important to retain one’s config.ini file and reindex. The process is not painful, if done infrequently. As time goes on I will also need to retain locally developed hacks, such as the ones I need to write below.

The next steps are to write the MARC record driver so it does not attempt to do automatic look-ups for call numbers, but rather extracts such information from of the local index. A second next step is to write an EAD record driver to accomodate the special cases of… EAD records.

Adding unitid elements to did elements

This posting outlines how I believe I will add unitid elements to did elements of EAD files.

The problem

As the CRRA matures, I expect a greater amount of the metadata ingested into the “portal” will come from EAD files. In order to index EAD files meaningfully, I need to extract unique identifiers from each container-level element, a human-readable description of the container, and a location code. The identifier and human-readable description can easily come from unitid and unititle elements of did elements.

Unfortunately, unitid (and maybe unititle) are not required elements of did elements. While the CRRA could mandate the creation of such elements, it turns out to be almost just as easy to create them on-the-fly.

Continue reading “Adding unitid elements to did elements”

Harvesting, updating, and re-indexing

This posting describes the automated process I am currently using to harvest, update, and re-index the MARC records of the “Catholic Portal“.

Step #1 – Make a list

Librarians love lists, and I am no exception. The process begins with a list (databases) of CRRA members who have MARC metadata to share. Each item in the list includes the following fields:

  1. code – a unique three-letter identifier
  2. institution – the name of the CRRA member
  3. library – the name of the member’s library
  4. URL – the location of their member’s MARC records

Right now, the name of this list is libraries.db. It is created by hand.

Continue reading “Harvesting, updating, and re-indexing”