How to make MARC and EAD metadata available in the “Catholic Portal”

February 22nd, 2012

This is a set of (draft) prescriptive instructions describing how to make MARC and EAD metadata available in the “Catholic Portal“.

Introduction

At its core, the “Portal” is an index — a list of pointers to content items. Access to this index is implemented through a form-based interface. Readers enter queries into the form, and items are returned. Readers are then expected to select items of interest from the returned list, and use them for the purposes of research and scholarship. In order to implement this functionality, each content item in the index requires, at the very least, three elements: 1) a unique identifier, 2) a human-readable description of the item, and 3) a location code where the item can be acquired.

The MARC and EAD metadata schemes are well-suited for indexing. After making sets of MARC records and/or EAD files transparently accessible on a Web server, it is easy to harvest the metadata, integrate it into the Portal’s index, and provide access to the content items.

The balance of this posting describes how to make MARC and EAD files available for harvesting.

MARC

Here’s the short version. Export all the MARC records from your integrated library system you think are apropos to the “Catholic Portal” making sure they are encoded using the UTF-8 character set. Save the resulting file on a Web server, and tell Eric Morgan the URL of the resulting file. Eric will do the rest.

Here’s the long version. Remember, every record in the Portal needs a unique identifier, a human-readable description, and a location code. For MARC records, this means every record first needs a value in the 001 field. Any value will do as long as it is unique to your set of records. Second, each MARC record needs something in the 245 field. At the very least this will be the human-readable description. All the other descriptive and analytic fields will supplement this description. Third, each MARC record needs to have a location code, and this is the item’s call number. This value will most likely be extracted from the 090 field.

Helping you decide which MARC records to extract from your integrated library system is beyond the scope of this document. But once you have figured that out it is recommended you denote which items are to be extracted by updating them with a local note. Here at the University of Notre Dame, we put the letters CRRA in field 590 subfield a. Once this is done it is relatively easy for the systems librarian to do a search for CRRA in field 590 subfield a, and dump the resulting records to a file. Alternatively, the systems librarian might search for all items whose call numbers begin with BX and dump the resulting set. The process you use to denote and export your MARC records depends on your local environment.

When exporting your MARC records from your integrated library system, it is imperative the records be encoded using the UTF-8 character set and not something else. The Portal’s underlying indexer does not deal very well with encodings of another kind. If your system does not export records as UTF-8, and it exports things in MARC-8 instead, then use an open source application called yaz-marcdump from Index Data to transform your records from one encoding into another. Once yaz-marcdump is installed you can execute a command like the following to do the transformation:

yaz-marcdump -f MARC-8 -t UTF-8 -o marc -l 9=97 input.mrc > output.mrc

The command translates MARC records from (-f) MARC-8 encoding to (-t) UTF-8 encoding. It outputs (-o) the result as MARC records, and inserts the letter a (ASCII character 97) into the leader (-l) at position 9. It uses the file named input.mrc as input, and it outputs the result to a file named output.mrc.

Every time you export your records, you should export everything that you feel is relevant to the portal. Do not worry about additions, changes, nor deletions. We here at Portal Central handle this issue by deleting all of your records locally and re-indexing the whole lot.

After the records have been exported, save them on a Web server, and finally, tell Eric Morgan the URL of the resulting file. Please don’t change the name of the URL. Eric will harvest the records and incorporate them into the index. As of this writing it is a good idea to tell Eric when new records are available, but at some point in time this won’t be necessary.

EAD

Here’s the short version. Use validated EAD files to encode the content you deem apropos to the Portal. Save all the EAD files in a single directory on a Web server making sure each file is given a .xml extension. Tell Eric Morgan the URL of the directory, and he will take care of the rest.

Here’s the longer version. Use whatever tool you desire to create EAD files describing the archival content you deem appropriate for the Portal. There are any number of available editors and applications facilitating this process. Make sure the resulting EAD files validate against the EAD DTD or schema. It doesn’t really matter which one, but right now validation against the DTD is easier to handle here at Portal Central.

Each did-level element in your EAD files will eventually become a record in the Portal’s index. During pre-processing here at Portal Central, unique unitid attributes will be added to each did-level element, if no unitid attributes exist in the first place. This pre-processing satisfies the need for unique identifiers. You need to do nothing in regards to unique identifiers.

Each did-level unittitle element will recursively be combined with its parent did/unittitle element to form a human-readable description of each content item. Consequently, there is nothing you need to do in regards to human-readable descriptions.

The location of items found in EAD files is facilitated in three ways. First, the name of your hosting institution and library/archive will be associated with each search result, thus the need for location information will be satisfied but only in a rudimentary way. Second, through the use of the url attribute of the eadid element, location information is re-enforced. Specifically, you are expected to include a value in the url attribute of the eadid element. This value is expected to point to a human-readable version of your EAD file on your Web server. Portal search results include hot links with a label similar to “View finding aid at owning institution”. The hot links will be the same as the value in the url attribute. Your human-readable version of the EAD file is then expected to include instructions and contact information describing how to acquire items of interest. Finally, search results will include a second hot link labeled similar to “View finding aid in Portal display”. These hot links will equal to a URL pointing to a local HTML file transformed from the original EAD. Again, location and contact information should be a part of the HTML because it was a part of the original EAD.

In summary, create complete and valid EAD files making sure you include values in the url attributes of the eadid elements.

Once you have created your EAD files, save them in a single directory on a Web server, and tell Eric Morgan the URL of the directory. Make sure each EAD file ends with a .xml extension. Eric will then regularly harvest all the .xml files from your directory, re-validate them, make sure they include url attributes, add unique identifiers to each did-level element, and index each did-level element.

Philadelphia Archdiocesan Historical Research Center (PAHRC) records

February 7th, 2012

Just less than 1,100 records from the Philadelphia Archdiocesan Historical Research Center (PAHRC) have been added to the “Portal” — http://bit.ly/uG92RG

Content from the University of Dayton

January 16th, 2012

Twenty-nine records from the Archives at the University of Dayton added to the “Catholic Portal” — http://bit.ly/weVl8h

Indexing PastPerfect metadata for the “Catholic Portal”

December 15th, 2011

Using VuFind’s inherent ability to index OAI metadata, I have successfully been able to index metadata coming from a PastPerfect implementation.

Starting somewhere near version 1.2, VuFind supports the indexing of arbitrary metadata types. Content from OAI repositories was the original example. Later, I figured out how to index EAD files. This was a break through for the “Portal”. Give credit to open source software.

With the addition of the Philadelphia Archdiocesan Historical Research Center (PAHRC) into the Catholic Research Resources Alliance, a new metdata format needed to be accepted — metadata other than EAD or MARC. PAHRC uses “cataloging” software called PastPerfect. From what I can tell, it is a sophisticated FoxPro/Microsoft Access database application. It provides the means for institutions to do data entry, and have their holdings searched, and ultimately displayed on the Web.

PastPerfect can export its metadata in a form of Dublin Core. After working closely with Shawn Weldon, Faith Charlton (both of PAHRC), and Brian Gomez (Past Perfect, Inc), the metadata exported by PAHRC was tweaked to be less ambiguous and more accurate. Once this was done I was able to harvest the metadata, parse it into something usable by VuFind’s Solr indexer, and make it available through the Portal. I did this with a script called pastperfect-index.pl. The result is a set of searchable records from PAHRC.

My current implementation is specific to PAHRC, and when other PastPerfect libraries/archives come on board, it will not be too difficult to abstract my implementation to support other institutions. That work is left to the future, when and if it occurs.

Fun with open source software!

Duplicate records in the “Catholic Portal”

December 9th, 2011

There is some concern about duplicate records in the “Catholic Portal”, and this posting introduces the topic to a wider audience.

The “Catholic Portal” is intended to contain links to and content of a rare and infrequently held nature. Every once in a while search results return duplicate records. For example, yesterday, it was brought to our attention that there are five records with the title Life Of Mrs. Eliza A. Seton. On one hand, few if any of these records are duplicates because between the five of them they are held by two different institutions. And each institution owns multiple editions. In the sense of a “catalog”, this is perfectly acceptable, if not expected. On the other hand, the Portal is not a catalog but rather an index, and each of the five items are really a variation on a theme. Should these records be merged?

Demian Katz shared with me and the Portal’s Digital Access Committee a query that can be applied the Portal’s underlying Solr index, here, with carriage returns added for readability:

http://localhost:8080/solr/biblio/select/?
q=*%3A*&rows=0&start=0&facet=true&facet.mincount=2&
facet.limit=-1&facet.field=oclc_num&facet.field=isbn

The result of this query is a list of OCLC and ISBN numbers which occur in the index at least two times. According to the result, which only matches on the OCLC or ISBN keys, there are no records in the index appearing more than three times. Furthermore, there are about 1,100 duplicated OCLC numbers and about 300 duplicated ISBN numbers. Considering the total number of records (93,000) in the index, this represents a total duplication rate of approximately 1.5%. Is this value too high?

In an ideal world, there would be no duplicate records and/or duplicates would be merged into a single record. Unfortunately, the definition of “duplicate” is ambiguous, and a process for eliminating duplicates has not been implemented. To a Walt Witman scholar, the difference between various editions of The Leaves Of Grass is definitely significant. Thus, sometimes the differences in editions is very important. Other times and for other people, this is not always so important. In an ideal world, there would be no duplicates and a single record would warrant a de-duplication process, but the expense of de-duplicating that single record may be very high, especially if there is no de-duplication process in place. How many records — or what percentage of records — warrants a de-duplication process, especially considering the other things that have been set as priorities for the Portal? Honestly, I don’t know the answer.

Survey of Digitized Rare Catholica – Results

November 22nd, 2011
            
Bible Text

Marta Deyrup and Martha Loesch, catalogers at (CRRA instution) Seton Hall University, and Pat Lawton, digital projects librarian for the CRRA, have released the results of their Survey of Digitized Rare Catholica held by Catholic universities, colleges, seminaries and archives in the U.S. and Canada. You may view the Summary Report of Results and the results data.

Portal surgery

November 11th, 2011

I was recently told to delete thousands upon thousands of records from the “Catholic Portal”, and through the magic of the Solr’s Web-based API and a full-featured HTTP client I was able to do this surgery with laser beam accuracy.

Specifically, I needed to delete all of the records in the Portal from the University of Notre Dame Archives because the Archives wanted to totally replace what finding aids were available. This meant deleting more than a 100,000 records from the underlying index. After a bit of investigation, I learned that at the following one-liner from the command line would do the trick:

curl http://localhost:8080/solr/biblio/update?commit=true -H "Content-Type: text/xml" --data-binary '<delete><query>id:unaead_*</query></delete>'

In short, curl is a command-line HTTP client. It is being told to first connect to the local host on port 8080. It is then told to find all the records matching the query “id:unaead_*” and delete them from the index named biblio. Once that is done, the underlying index is expected to commit the changes. Deleting these records took about ten minutes. I was then able to use my previously created scripts to harvest, validate, transform, and index the Archives’ content painlessly.

It is a pleasure when things work in the way they were designed! Now if I could only get my local indexing process to work faster.

VuStuff II: A Travelogue

November 1st, 2011

On Wednesday, October 12, 2011 I had the opportunity to attend and present at the second annual VuStuff meeting held at Falvey Library, Villanova University (Philadelphia). This posting documents my experience there, but in a nutshell, this small and intimate meeting provided a venue for interesting discussion on the topic of modern librarianship.

liberty bell
Liberty Bell
cheese steak sandwich
cheese steak sandwich

Joe Lucia (Villanova University) initialized the meeting and set the stage by recommending a book called The Googlization of Everything. It advocates the creation of an open knowledge commons similar to the ones at the root of the fledgling Digital Public Library of America. To paraphrase his remarks, “Everything we do in our shop here embrases the open knowledge commons concept… Libraries are not just purveyors of content, but also creators of content — The New Resource Sharing. We [librarians] can become agents of information creation.”

The first presentation was given by Amy Baker Williams (University of Pittsburgh), and she described her process for conserving the maps of local coal mines. In the Pittsburgh (Pennsylvania) area there are many coal mines dating back as far as 1750. Some of the oldest maps of the mines date from 1850. A few years ago some miners were trapped in a mine, and if maps of the mines had been easily accessible, then rescue efforts would have been simplified. Since then concerted efforts have been made to preserve, digitize, and make accessible as many of these coal mining maps as possible in order to prevent similar accidents from happening in the future. I found the process used to flatten the maps to be the most interesting. Basically they are re-hydrated and unrolled. Moving the maps from the conservation lab to the scanning location was also interesting because, ironically, the maps are rolled up again for transportation as well as long-term storage. For more detail, see the website.

My presentation was next, and I shared with the audience how the Catholic Research Resources Alliance (CRRA) is using VuFind to implement the “Catholic Portal”. I first described the mission and history of the CRRA. I then outlined the Portal’s technical architecture as well as the process I used to index EAD files. Finally, I described how text mining functions have been integrated into the Portal’s interface emphasizing the possibilities for libraries in general.

library
Falvey Library
mural
mural

During lunch we broke up into groups, and I sat with the folks interested in the digital humanities. For the most part we went around the table sharing common war stories. Most of our initiatives where fledgling, but there was plenty of enthusiasm.

After lunch a sort of “unconference” session was facilitated by David Upsal (Villanova University). The discussion topic that made itself apparent was the challenge of the profession to serve both traditional librarianship as well as librarianship in the current environment. If my memory serves me correctly, some of the suggested solutions included more resources (people and money), permission to “play” with new technology, a redefinition of library purpose, and greater collaboration between different types of libraries (public, academic, etc.)

The next presentation was given by Eric Zino (LYRASIS) who described how LYRASIS has been working with the Sloan Foundation and the Internet Archive to facilitate the digitization of 20,000,000 pages of library content. Approximately 160 libraries have been participating in the project with LYRASIS. Subsidized by the Foundation, partipants package up their content and ship it to the Internet Archive. The content gets digitized, returned to the owning library, and the digital versions are made accessible at the Archive. From my perspective, this is exactly how any other library works with the Archive, except in this case LYRASIS does a bit of hand-holding during the process. Not all media is digitized by the Archive though. Some things, such as microfilm, are scanned by a different vendor — Creekside Digital.

The last presentation of the day was given by Bob Behary (Duquesne University), and he shared with the audience how Duquesne is digitizing a newspaper called the Pittsburgh Catholic. The project was initiated by a Catholic order called the Spiritans (the founding order of Duquesne University) with evangelism at its root. At first digitized versions of the newspaper were put on CDs and distributed. This has evolved over time, and now the content is housed in a ContentDM system. The collection has proven useful in a number of ways, including: local & regional church histories, literature allusions (such as Emily Dickinson), and United States history. Behary listed a number of key considerations for any digitization effort: 1) get administrative support, 2) make sure the project fits within the mission of the institution, 3) make sure to use sustainable technology, and 4) ensure knowledgable research advocates are a part of the process.

Vuee award
Vuee Award
stairs
Art Museum staircase

I believe the meeting was attended by fifty to seventy-five people. Most were from the immediate area, and it offered a easy opportunity for professional development. Kudos to the folks at Villanova for hosting the event. Just before the meeting concluded I was awarded the second annual “Vuee” for best presentation. It is a small shoebox-sized container in the shape of a book. I was very flattered. “Thank you very much!”

Indexing EAD files in the “Catholic Portal” with VUFind

October 25th, 2011

This posting describes how EAD files are indexed in the “Catholic Portal” with VUFind.

VUFind is a “next-generation library catalog” or “discovery system” application. Its primary purpose is to index bibliographic metadata and provide a reader-friendly interface to the result. The heart of this process is a Solr index made up of many bibliographic-like fields. These fields are the usual suspects including a host of variants on author, title, institution, building, collection, language, format, physical description, publisher, published date, edition, description (note), contents, URL, call number, ISSN, ISBN, OCLC number, series, topic, genre, geographic, era, illustration, full text, and record type. In order for EAD files to be searchable in the Portal, they need to have their metadata extracted, the metadata needs to be mapped to Solr fields, and the metadata needs to be added to the index. The balance of this posting describes this in more detail.

Pre-processing

Before any indexing can take place, bits of pre-processing are applied against the EAD files. In a nutshell, this pre-processing (and the Perl scripts doing the work) includes:

  1. harvesting the EAD files from a remote HTTP server and caching them locally (ead-harvest.pl) – Done so the balance of the work can be done.
  2. validating the EAD files against the DTD and/or schema (ead-validate.pl) – Done because we don’t want to practice GIGO (Garbage In, Garbage Out).
  3. adding unique identifiers to each did-level element of the EAD files (ead-transform.pl) – The Solr indexer requires unique identifiers for each indexed item. This process provides the identifiers as well makes it easy to hyperlink directly to a place in the EAD through the use of HTML anchors.
  4. transforming the EAD files into HTML and making the results Web accessible (ead-transform.pl) – Done because links to remote versions of the EAD files break, and humans do not read XML very well.

Indexing

The bulk of the indexing process centers around the acquisition of metadata, and it is completely handled by a Perl script named ead-index.pl:

  1. The process begins by looking up the name of the institution and the name of the library from where the EAD file was created. These values are located in a rudimentary tab-delimited database.
  2. Next, the value for record type is denoted. It is always “EAD”.
  3. Third, a value for format is denoted. It is always “Archival material”.
  4. Next, the language of the material is extracted from the /ead/archdesc/did/langmaterial/language element. If no language is specified, then language is denoted as “Unknown”.
  5. Each did-level element from the EAD file is then examined pulling out its unique identifier (the id attribute of unitid element created in Step #3 of pre-processing), title (the unittitle element), and date (the unitdate element). The title metadata is a bit special since it is really a concatenation of all the parent title values of the given did element. This is done because each item in an EAD file is a part of the entire collection, and this enhanced title is intended to provide context.
  6. At this point the metadata for each did-level element has been extracted and is mapped to a select number of Solr fields, namely:
    • id -> unique identifier;
    • title -> title
    • title_auth -> title
    • title_full -> title
    • title_fullStr -> title
    • title_full_unstemmed -> title
    • title_short -> title
    • title_sort -> title
    • publishDate -> date
    • format -> always “Archival material”, from Step #2
    • institution -> the name of the library’s hosting institution, from Step #1
    • building -> the name of the library, from Step #1
    • fullrecord -> An XML snippet containing the unique identifier, title, date, as well as two URLs pointing to HTML versions (local and remote) of the EAD file
    • recordtype -> always EAD, from Step #3
    • language -> language, from Step #4
  7. Finally, the metadata is added to VuFind’s underlying Solr index.

Discussion

The indexing process is far from perfect. For example, in the current process, the entire head element of the EAD file is ignored. While it contains very rich metadata, such as controlled vocabulary terms and abstracts, these values describe the collection as a whole and do not necessarily apply to each individual did-level element.

Second, creating EAD files is laborious in the first place. There are not enough resources in most archival departments to describe did-level elements with much more detail than title and date. It would be nice to have a narrative summary describing of each did-level element, a more specific format, some key words or controlled vocabulary, a consistently formatted date, etc. But again, creating such metadata for each did-level element is expensive. Consequently, indexed items are not described as robustly as possible.

Third, while VuFind’s implementation of Solr is bibliographic in nature, it is heavily weighted towards bibliographic metadata describing books. OCLC number. Call number. ISBN & ISSN. Edition. Etc. There are no fields for EAD-specific things such as postal addresses, provenance, nor biographies.

Again, the process is not perfect, but it does enable the Catholic Research Resources Alliance to amalgamate the metadata of its member institutions and provide a searchable index to the result. Suggestions for improvement are welcome.

“Advancing Catholic Scholarship” Symposium at Duquesne Nov. 9-10

October 12th, 2011

Colleagues,

The registration deadline for this CRRA/Duquesne sponsored event is this Friday, October 15, 2011.  We are pleased that many of you have already registered for the event and if you have thought about registering, please do so now.  There is no fee to register.

The event features Catholic scholars, archivists, and librarians gathering together to consider the state of Catholic scholarship and how we can act together to advance and enhance freely available global access and discovery of important Catholic resources. The event will take place at Duquesne University (Pittsburgh) on Nov. 9-10.

We encourage librarians, scholars, and archivists interested in learning more about opportunities to make scholarly resources accessible to join in and meet new friends and colleagues.

A full roster of events and registration information is available at http://bit.ly/Duquesne_Symposium .   The registration deadline is this Friday, October 15, 2011.

We hope that you will join us in what promises to be a stimulating and productive conversation about Catholic scholarly research and the ways in which librarians and archivists support this research.

On behalf of Duquesne University and the Catholic Research Resources Alliance (CRRA),
•         Jennifer Younger, chair, Board of Directors at younger.1@nd.edu
•         Laverna Saunders, University Librarian, Gumberg Library, Duquesne University at lsaunders@duq.edu
•         Pat Lawton, CRRA Digital Projects Librarian at plawton@nd.edu