This posting outlines how I plan to prepare EAD files for indexing with Solr, the underlying indexing technology of VUFind.
I am aggregating sets of EAD files from Catholic Research Resource Alliance members. I am expected to index these files at the most granular level possible — meaning at the
did level. In order to satisfy both human and computer requirements, each indexed record needs at least a unique identifier, a human-readable descriptor, and a location code. The unique identifier can be gotten from the
unitid element. The human-readable descriptor can come from the
unittitle. The location code can be inferred from the url attribute of the
Unfortunately, not all of the aggregated EAD files include a
unitid, and when they do, they are not always unique. Additionally, the hierarchal nature of EAD files make the values extracted from
unittitle elements almost meaningless unless they are placed within the context of their parent
unittitle values. In short, indexing EAD files without some preprocessing makes the indexing process all but useless. What to do?
The solution includes: 1) adding and/or normalizing the
unitid values, 2) constructing a more complete “title” based on previously enumerated
unittitle values, 3) and outputting the whole thing to an XML stream easily indexable by Solr.
Adding and/or normalizing the
unitid values (Step #1) can be accomplished with a stylesheet called addunitid.xsl. Essentially an identity transformation, the stylesheet loops through an EAD file using the
generate-id() function to create or replace
unitid values. The result is an enhanced EAD file.
Constructing more complete “titles” and outputting XML streams (Steps #2 and #3) is done by looping through the each
did element, extracting the necessary metadata, creating a record describing each
did-level element, and sending to
STDOUT a rudimentary XML stream of my own design. The heart of this second stylesheet (ead2solr.xsl) is the
did/unittitle selector used to find all the parent
unittitle values of a given
Finally, a simple shell script was written (clean.sh) making it easy to do the above transformations from the command line.
(I would not have been able to do this work if it weren’t for the XML4Lib mailing list and a few fine repondants to my pleas for help. Thanks go to MJ Suhonos, Tod Olson, Stefan Krause, and Alexander Johannesen. “Thank you!”)
Software is never done. If it were, then it would be called hardware. Therefore next steps include:
- automatically adding the modified EAD files (the output of the first stylesheet) to Archon
- enhancing the output of the second stylesheet with scope notes, abstracts, etc.
- indexing the output of the second stylesheet
Fun with XSLT?