Internet Archive content, VUFind (Solr), and text mining

The posting outlines how I have: 1) mirrored metadata and full text content from the Internet Archive, 2) made the mirrored content accessible through VUFind, and 3) implemented a rudimentary text mining interface against the mirror.

Background

The “Catholic Portal” is intended to be a research tool centered around “rare, unique, and uncommon” materials of a Catholic nature. Many of these sorts of things are older as opposed to newer, and therefore, many of these things are out of copyright. Projects such as Google Books and the Open Content Alliance specialize in the mass digitization of out of copyright materials. By extension we can hope some of the things apropos to the Portal have been digitized by one or more of these projects.

Continue reading “Internet Archive content, VUFind (Solr), and text mining”

Digital Access Committee (DAC) Meeting

Today we had a CRRA Digital Access Committee (DAC) meeting via the telephone. Attendees included:

  • Ann Hanlon
  • Demian Katz
  • Eric Frierson
  • Eric Morgan
  • Kevin Cawley
  • Pat Lawton
  • Susan Leister
  • Thomas Leonhardt

I did a bit of “Portal” show & tell demonstrating the work done to date on indexing EAD files. (See the previous blog posting.) We then discussed ways the indexing/display could be improved. Suggestions included:

  • putting the words “Archival material” into the format field of the Solr index thus allowing better faceting
  • reading the value of langmaterials and using it as the value for Solr’s language fields, again allowing for better faceting
  • reading all of the fields associated with a given container-level element and putting them into Solr’s allfields field to improve indexing
  • extracting the last value of our current “title”, using it as our title, and using the remaining values as some sort of supplemental description or alternatively, simply reversing the “title” string

We then brainstormed ways to resolve character encoding issues, the feasibility of making our metadata available via Web servers, and the status of the metadata guidelines.

We felt we had discussed it all, so the meeting was over.

Indexing MARC and EAD in VUFind with Solr for the CRRA

This posting outlines how I am currently indexing MARC and EAD files in VUFind with Solr for the CRRA. (Boy, there are a lot of acronyms in that sentence!)

Background

The Catholic Research Resources Alliance (CRRA) is a member-driven organization with the purpose of making available “rare, unique, and uncommon” research materials for Catholic scholarship. Presently the membership is primarily made up of libraries and archives who pool together their metadata records, have them indexed, and provide access to the index. My responsibility is to build and maintain the technical infrastructure supporting this endeavor.

Continue reading “Indexing MARC and EAD in VUFind with Solr for the CRRA”

Very satisfying!

I have made significant progress in the process of harvesting EAD files and preparing them for ingestion into the “Catholic Portal”. This posting outlines the successes.

Assuming a Catholic Research Resources Alliance members place their EAD files in a HTTP-accessible directory, and those files have a .xml extension, then the following Perl scripts enable me to harvest and prepare them for indexing:

  • harvest-ead.pl – reads remote HTTP-accessible directories and copies all of the .xml files found there to a local cache
  • validate.pl – makes sure the cached XML files are well-formed and conform to the EAD DTD, and if not, then move the files to a different directory
  • transform.pl – reads the validated XML files, adds id attributes to all unitid elements through the use of a stylesheet (addunitid.xsl), transforms the resulting XML into HTML using another stylesheet (ead2html.xsl), and saves the result to an HTTP-accessible directory

What was really cool and a huge time-saver was the use of ead2html.xsl. Originally named AAAv2002-HTML.xsl, found on a page called User Contributed Stylesheets, and submitted by Stephanie Ashley, this stylesheet took my id attributes and automatically made named anchors for me. Boy, did I get lucky. “Thank you, Stephanie!”

My next step is to revisit my indexing routines.

EAD @ Marquette 4 CRRA

This is the briefest of travelogues reporting on a meeting about EAD files at Marquette University for the Catholic Research Resources Alliance on September 20, 2010.

marquette sights

A few members of the Alliance were previously awarded a CLIR grant to catalog previously uncataloged special collections items. These members are now doing the work using EAD (Encoded Archive Description) with the intent of sharing the resulting metadata with the “Catholic Portal”. The purposes of the meeting were to build relationships between these particular Alliance members and to discuss progress on the grant. In attendance where people from St. Catherine University (Deborah Kloiber and Emily Asch), Marquette University (Matt Blessing, Ann Hanlon, Bill Fliss, and Jean Zanoni), and the University of Notre Dame (Pat Lawton, Kevin Cawley, and Eric Lease Morgan).

Continue reading “EAD @ Marquette 4 CRRA”

index-ead.pl

Today I indexed some of the metadata I extracted yesterday using a script called index-ead.pl. Of all the scripts I’ve written so far, this one is the most straight-forward. Read locally-developed XML file. Extract the unique identifier, title, and date. Associate each with VUFind/Solr fields. Commit.

You can (temporarily) see the fruits of these labors because all of the records have been associated with the Eric Lease Morgan Foo Bar Library. The result is a list of container-level records with very little additional information.

By the way, as of today I am running a version of VUFind as retrieved from the development trunk, specifically, revision 3029. When upgrading from revision to revision, it is important to retain one’s config.ini file and reindex. The process is not painful, if done infrequently. As time goes on I will also need to retain locally developed hacks, such as the ones I need to write below.

The next steps are to write the MARC record driver so it does not attempt to do automatic look-ups for call numbers, but rather extracts such information from of the local index. A second next step is to write an EAD record driver to accomodate the special cases of… EAD records.