This posting describes how EAD files are indexed in the “Catholic Portal” with VUFind.
VUFind is a “next-generation library catalog” or “discovery system” application. Its primary purpose is to index bibliographic metadata and provide a reader-friendly interface to the result. The heart of this process is a Solr index made up of many bibliographic-like fields. These fields are the usual suspects including a host of variants on author, title, institution, building, collection, language, format, physical description, publisher, published date, edition, description (note), contents, URL, call number, ISSN, ISBN, OCLC number, series, topic, genre, geographic, era, illustration, full text, and record type. In order for EAD files to be searchable in the Portal, they need to have their metadata extracted, the metadata needs to be mapped to Solr fields, and the metadata needs to be added to the index. The balance of this posting describes this in more detail.
Before any indexing can take place, bits of pre-processing are applied against the EAD files. In a nutshell, this pre-processing (and the Perl scripts doing the work) includes:
- harvesting the EAD files from a remote HTTP server and caching them locally (ead-harvest.pl) – Done so the balance of the work can be done.
- validating the EAD files against the DTD and/or schema (ead-validate.pl) – Done because we don’t want to practice GIGO (Garbage In, Garbage Out).
- adding unique identifiers to each did-level element of the EAD files (ead-transform.pl) – The Solr indexer requires unique identifiers for each indexed item. This process provides the identifiers as well makes it easy to hyperlink directly to a place in the EAD through the use of HTML anchors.
- transforming the EAD files into HTML and making the results Web accessible (ead-transform.pl) – Done because links to remote versions of the EAD files break, and humans do not read XML very well.
The bulk of the indexing process centers around the acquisition of metadata, and it is completely handled by a Perl script named ead-index.pl:
- The process begins by looking up the name of the institution and the name of the library from where the EAD file was created. These values are located in a rudimentary tab-delimited database.
- Next, the value for record type is denoted. It is always “EAD”.
- Third, a value for format is denoted. It is always “Archival material”.
- Next, the language of the material is extracted from the /ead/archdesc/did/langmaterial/language element. If no language is specified, then language is denoted as “Unknown”.
- Each did-level element from the EAD file is then examined pulling out its unique identifier (the id attribute of unitid element created in Step #3 of pre-processing), title (the unittitle element), and date (the unitdate element). The title metadata is a bit special since it is really a concatenation of all the parent title values of the given did element. This is done because each item in an EAD file is a part of the entire collection, and this enhanced title is intended to provide context.
- At this point the metadata for each did-level element has been extracted and is mapped to a select number of Solr fields, namely:
- id -> unique identifier;
- title -> title
- title_auth -> title
- title_full -> title
- title_fullStr -> title
- title_full_unstemmed -> title
- title_short -> title
- title_sort -> title
- publishDate -> date
- format -> always “Archival material”, from Step #2
- institution -> the name of the library’s hosting institution, from Step #1
- building -> the name of the library, from Step #1
- fullrecord -> An XML snippet containing the unique identifier, title, date, as well as two URLs pointing to HTML versions (local and remote) of the EAD file
- recordtype -> always EAD, from Step #3
- language -> language, from Step #4
- Finally, the metadata is added to VuFind’s underlying Solr index.
The indexing process is far from perfect. For example, in the current process, the entire head element of the EAD file is ignored. While it contains very rich metadata, such as controlled vocabulary terms and abstracts, these values describe the collection as a whole and do not necessarily apply to each individual did-level element.
Second, creating EAD files is laborious in the first place. There are not enough resources in most archival departments to describe did-level elements with much more detail than title and date. It would be nice to have a narrative summary describing of each did-level element, a more specific format, some key words or controlled vocabulary, a consistently formatted date, etc. But again, creating such metadata for each did-level element is expensive. Consequently, indexed items are not described as robustly as possible.
Third, while VuFind’s implementation of Solr is bibliographic in nature, it is heavily weighted towards bibliographic metadata describing books. OCLC number. Call number. ISBN & ISSN. Edition. Etc. There are no fields for EAD-specific things such as postal addresses, provenance, nor biographies.
Again, the process is not perfect, but it does enable the Catholic Research Resources Alliance to amalgamate the metadata of its member institutions and provide a searchable index to the result. Suggestions for improvement are welcome.