Indexing and displaying Encoded Archival Description files

This posting muses on how to index and display Encoded Archival Description (EAD) files in the “Catholic Portal” of the Catholic Research Resources Alliance.

The Catholic Portal is essentially an index of two types of metadata: 1) records describing individual and discrete items, and 2) records describing collections of individual items. For the most part, the former metadata records are MARC records describing books. The later are EAD files describing the holdings of archives.

Here at “Catholic Portal Central” we use an application called Vufind to index our metadata. At its core is the current gold standard of indexers — Solr. A set of PHP scripts is then wrapped around Solr to support querying the index, displaying search results, and providing services (tagging, emailing, saving, etc.) against them. Vufind is/was designed to be used by libraries as a “discovery system” — a tool to support searching of the venerable library catalog.

Initially the Portal’s index included individual discrete items — books — and Vufind fit the bill pretty well. As both the membership and scope of the Alliance evolved, an increasing amount of the metadata to index was archival in nature — EAD files. Being a librarian by profession and not an archivist, I did not originally appreciate the hierarchal nature of EAD files.

In order to facilitate discovery, I originally parsed the header of my EAD files complete with all of its rich controlled vocabulary terms. While this indexing process worked, it missed all of the individual items in the body of the EAD file. “We need full text indexing,” I was told. “We need to have access to the entirety of the EAD file in order to appreciate the collection as a whole.” Consequently, I changed my routine to index each did-level item only, even though each did-level item contained only the tiniest bit of metadata. Using this second technique it was now possible to find a letter written by Dorothy Day in a Graham Greene collection. On the other hand, search results, because of the meager metadata, were flooded with items of little or no context — the hierarchal nature of the EAD file had been lost in the display and therefore the context gone.

“We like the way other indexes of EAD files work. Go look at them.” So I created a sample of EAD file indexing sites, and then I did a bit of compare & contrast. Here are the sites in my sample:

With only slight variations, each index functions similarly:

  1. enter a search term
  2. get back a list of EAD headers
  3. sometimes get back snippets containing search terms
  4. other times get back a link to snippets containing search terms
  5. get back a link to view a navigable EAD in its entirety

Assuming the functionality above is the current best practice, then it should not be too difficult to implement most of the functionality above. To do so I will need to change my indexing routine to include the EAD header and each of the did-level elements. This alone should improve recall since just about none of the EAD files’ headers are presently being indexed. After indexing I will need to change the EAD record driver to display additional fields, and the result should be more a more complete display. Displaying snippets will be a challenge since the full text of EAD file is not retained in the underlying Solr index and therefore not displayable, but I may be able to exploit my Perl-based concordancing modules to address this problem. (“When you have a hammer, everything begins to look like a nail.”) Finally, linking to a navigable EAD in its entirety is already implemented in the Portal. Only small tweaks should be necessary.

How long should all of this take? I’m not sure. I’m guessing a couple of months. Wish me luck. On my mark. Get set. Go.

Author: Eric Lease Morgan

I am a librarian first and a computer user second. My professional goal is to discover new ways to use computers to provide better library services. I use much of my time here at the University of Notre Dame developing and providing technical support for the Catholic Research Resources Alliance -- the "Catholic Portal".