Archive for October, 2010

Internet Archive content, VUFind (Solr), and text mining

Tuesday, October 26th, 2010

The posting outlines how I have: 1) mirrored metadata and full text content from the Internet Archive, 2) made the mirrored content accessible through VUFind, and 3) implemented a rudimentary text mining interface against the mirror.

Background

The “Catholic Portal” is intended to be a research tool centered around “rare, unique, and uncommon” materials of a Catholic nature. Many of these sorts of things are older as opposed to newer, and therefore, many of these things are out of copyright. Projects such as Google Books and the Open Content Alliance specialize in the mass digitization of out of copyright materials. By extension we can hope some of the things apropos to the Portal have been digitized by one or more of these projects.

Very recently St. Michael’s College in the University of Toronto has become a member of the Catholic Research Resources Alliance, and consequently, they desire to contribute to the Portal. As it just so happens, the University of Toronto has been a big proponent of mass digitization. They have been working with the Open Content Alliance for quite a while. Much of their content, including content from St. Michael’s, has been digitized. Complete with MARC records, PDF files, and plain text these digital artifacts are freely available for downloading. Moreover, the availability of full text content opens up the doors to all sort of text mining and digital humanities computing techniques in library “discovery systems”. Collocations. Word clouds. Graphing and mapping. Concordancing. Etc. As an example of one such discovery system, the Portal not only provides access to the content, but it can also make the content useful.

With input from Dave Hagelaar, Pat Lawton, and Remi Pulwer I implemented all of the things above, to some degree. The balance of this posting describes how.

The Process

Dave Hagelaar from St. Michael’s College sent me a set of around 600 Internet Archive unique identifiers from their collection representing “rare, unique, and uncommon” materials. Based on previous work, I was able to harvest the metadata, mirror the content, and integrate the whole into our VUFind interface. The process included the following steps:

  1. Convert identifiers – Each of the Internet Archive identifiers (keys) represent a Web page complete with metadata and links to digital content. The identifiers look something like this: delancienneetdel00rich. Given this information sets of URLs can be constructed pointing to locations at the Archive. Creating a set of URLs based on the list of keys was done with a trivial Perl script called keys2urls.pl. The resulting URL look like this:
  2. Mirror content – The next step was to copy the remote data locally — mirror it. This was done using the venerable wget program. Essentially, wget is called with a very long set of parameters as well as the output from Step #1. The result is a local cache of MARC, PDF, and plain text files. Since these files were saved in their own directory on an HTTP file system, each file has its own URL. To make life easier, the running of wget with all of its parameters was implemented as a simple shell script — mirror.sh
  3. Enhance MARC records – Given the additional locations of the mirrored content, the MARC records harvested from the Internet Archive were not complete. They did not include URLs pointing to the Internet Archive, nor did they include the URLs pointing to the local cache. Consequently the next step was to enhance the MARC records. This was done with a second Perl script called updatemarc.pl, but the script does more. Since we hoped to provide text mining services against the full text, a third URL needed to be included in the MARC pointing to the text mining interface. Finally, since the text mining application needs a bit of metadata itself, a rudimentary database listing the full text items is created along the way. This entire subprocess was complicated by the fact that not all of the harvested MARC records were valid. Because of character encoding issues, some of them were not readable by my MARC record parser (MARC::Batch). Some of the records are structurally incorrect. Invalid leaders and misplaced record/field/subfield delimiters. Finally, some of the records apparently included invalid values for various indicators. To make sure the database was as clean as possible, any record generating any sort of error was not included in the final processing. This left approximately 400 of the original 600 records.
  4. Index MARC records – The next step was to ingest the MARC records into VUFind’s underlying Solr index. This was done with a Perl script called marc-index.pl and described in a previous posting. With the completion of this step, the content provided by St. Michael’s College became available in the Portal. Search or browse the Portal for records. Find items from St. Michael’s. Click on a link to get the content from the Internet Archive. Click on another link to retrieve it from the local cache. For example, see the record for Letters of an Irish Catholic layman.
  5. Support text mining – The final step in the process deserves a blog posting in its own right, and thus only a summary will be provided here. At its foundation, text mining surrounds the process of counting ngrams whether they be single letters, single syllables, multiple syllables, individual words, multi-word phrases, sentences, etc. Once these things are counted they can be measured. Once they are measured, patterns can be sought, and if patterns are found, then overarching descriptions can be articulated resulting in the creation of new knowledge or an increase in understanding. When coupled with concordances, ngrams can be placed within the context of the larger work to learn how they were used. Using two Perl modules (Lingua::EN::Ngram and Lingua::Concordance) a simple Web-based interface was written allowing the scholar to list the most frequent ngrams in a text, map their relative locations in it, and read snippets of text surrounding them. Using this technique it is possible to quickly and easily get an overview of the content of a document. The text mining application I created is initialized with an Internet Archive identifier. The application reads the identifier, looks up the location of the locally cached plain text file, reads it into memory, and allows the researcher to do “distant reading” against it. Unfortunately Lingua::Concordance only works sporadically against non-English files, but you can still see how the system works by using the concordance against Letters of an Irish Catholic layman.

Summary

The process outlined above described how full text content can be harvested from the ‘Net and integrated into the VUFind “discovery system”. The key to doing this easily was the existence of metadata (MARC records) describing the harvested items. Without this metadata the process would have been too laborious. The process also outlined how the harvested full text can be put to greater use through a simple text mining interface.

Software is never done. If it were, then it would be called hardware. Consequently, there are many ways the process can be improved. Examples include figuring out ways to repair broken MARC records, and updating Lingua::Concordance to work correctly with foreign language materials. Maybe I should call this job security.

Names & addresses

Thursday, October 14th, 2010

This posting outlines how the names & addresses of the “Catholic Portal” are made available. The purpose of this posting is mostly documentation. Documentation for myself, since I always forget. And documentation so somebody else can do the work after I win the lottery and move to the beach to drink cocktails with umbrellas in them.

Here goes:

  1. Extract data – Open the spreadsheet. Activate the ACCU tab. Copy all of the data sans the “cool” data entry macros. Create a new spreadsheet. Paste all of the previously cut data into the new spreadsheet. Save the new spreadsheet for future reference with the name catholic_libraries.xls.
  2. Extract more data – Repeat Step #1 for the tab labeled Tab 2, but save the newly created spreadsheet with the name atla.xls.
  3. Clean – Open catholic_libraries.xls and delete columns so the only ones remaining are: last name, first name, school, address, city, state, zip code, and email address. Make sure the remaining columns are in the the order listed above. During this process the data may need further cleaning. For example, curly quotes need to be straightened. Carriage returns inside cells need to be removed. Make sure city and state values contain only… city and state values. No countries.
  4. Sort – Sort catholic_libraries.xls in ascending order by school.
  5. Save – Save the cleaned and sorted data as a tab-delimited text file with the name catholic_libraries.db. Make sure the resulting text file is Unix-based and not DOS- or Macintosh-based. Additionally, Excel often tries to do you a favor by surrounding fields containing commas with quotes. Remove the quotes.
  6. Go to Step #3 – Repeat the process for the file named atla.xls, but include only the last name, first name, school, city, state, and email address in the saved data, and call the result atla.db.
  7. Mount – Mount the saved database files (catholic_libraries.db and atla.db) by saving them to the Portal. They are expected to live in Y:\data\vufind\web\etc.

The result of this work should then be visible under the Directory tab.

This process is a bit tedious, but since the directory does not change very often, and now that I have documented the process, the next time the directory needs updating things will be easier. On the other hand, as the Portal grows, there will be a need for a real database, and it will be able to support additional functions, such as document delivery. Stay tuned.

Digital Access Committee (DAC) Meeting

Tuesday, October 12th, 2010

Today we had a CRRA Digital Access Committee (DAC) meeting via the telephone. Attendees included:

  • Ann Hanlon
  • Demian Katz
  • Eric Frierson
  • Eric Morgan
  • Kevin Cawley
  • Pat Lawton
  • Susan Leister
  • Thomas Leonhardt

I did a bit of “Portal” show & tell demonstrating the work done to date on indexing EAD files. (See the previous blog posting.) We then discussed ways the indexing/display could be improved. Suggestions included:

  • putting the words “Archival material” into the format field of the Solr index thus allowing better faceting
  • reading the value of langmaterials and using it as the value for Solr’s language fields, again allowing for better faceting
  • reading all of the fields associated with a given container-level element and putting them into Solr’s allfields field to improve indexing
  • extracting the last value of our current “title”, using it as our title, and using the remaining values as some sort of supplemental description or alternatively, simply reversing the “title” string

We then brainstormed ways to resolve character encoding issues, the feasibility of making our metadata available via Web servers, and the status of the metadata guidelines.

We felt we had discussed it all, so the meeting was over.

Indexing MARC and EAD in VUFind with Solr for the CRRA

Tuesday, October 12th, 2010

This posting outlines how I am currently indexing MARC and EAD files in VUFind with Solr for the CRRA. (Boy, there are a lot of acronyms in that sentence!)

Background

The Catholic Research Resources Alliance (CRRA) is a member-driven organization with the purpose of making available “rare, unique, and uncommon” research materials for Catholic scholarship. Presently the membership is primarily made up of libraries and archives who pool together their metadata records, have them indexed, and provide access to the index. My responsibility is to build and maintain the technical infrastructure supporting this endeavor.

A couple of years ago much of the CRRA metadata was manifested as MARC, and at that time VUFind was selected as the tool we would use to index, search, and display this content. About six months ago the Alliance realized the growing necessity of including EAD files as well. At the same time, the ability of accomodate non-MARC metadata was increasingly becoming a VUFind reality. New ground still had to be broken; processes needed to be implemented allowing VUFind (and the underlying Solr indexer) to understand how to work with materials which were not book-like.

The balance of this posting describes in greater detail how I am beginning to accomodate MARC as well as EAD metadata into VUFind’s interface with Solr.

Assumptions

The system runs on a number of assumptions. First, it is assumed it is the members’ responsibility to create and maintain their metadata. Second, it is my responsibility to index it and make it available for display. Moreover, it is assumed each metadata record incudes at least three values: 1) a unique identifier, 2) a human-readable description of an item, and 3) an address pointing to the location of the item. For MARC records, these things reside in the 001, 245, and 099 fields. For EAD files, they have been designated as the id attribute of unitid elements, the content of unititle elements, and the url attribute of the eadid element and from there the location of the item.

Additionally, it is assumed all metadata records, whether MARC or EAD, are available for harvesting from a Web server. In other words, each member who wants to have their MARC records available in the CRRA needs to export their records to a single file and make them accessible via a URL. Similarly, all EAD files which are intended to be indexed need to be in a single Web-accessible directory and the URL of the directory needs to be known. Making member metadata accessible via a Web server has three benefits: 1) it facilitates automation, 2) it distributes the responsibility of archiving metadata across the membership, 3) it enables the metadata to be harvested by other applications and used for other things. “Can you say ‘linked data?’”

Files and Perl scripts

Given these assumptions, the following sets of files and Perl scripts are used to do the work. The first set is core the both of the other two:

  • libraries.db – A “database” of CRRA participants consisting of their names, libraries, and URLs where their metadata records can be found. This file is used by just about every other script in the system.
  • subroutines.pl – A tiny library of Perl subroutines, mostly to read the contents of libraries.db.

This second set is used to index MARC metadata:

  • marc-harvest.pl – Copies (mirrors) remote MARC files locally
  • marc-add-code.pl – Validates and updates the values of MARC 001 fields making sure they exist and are unique
  • marc-index.pl – Slurps up a Solr marc.properties template (template.txt), makes the appropriate substitutions, and indexes the MARC records associated with a given library
  • marc-build.sh – A shell script used to run all of the MARC-based scripts. One ring to rule them all.

The third is used to index EAD files:

  • ead-harvest.pl – Copies (mirrors) remote XML files locally
  • ead-validate.pl – Makes sure the mirrored XML files are well-formed, conform to the EAD DTD, and include an eadid url attribute (done with a stupid stylesheet called geturl.xsl)
  • ead-transform.pl – Makes sure each EAD container-level element includes a unitid with a unique id attribute, saves the result to a local cache, and transforms these same files into HTML. The first process is done with a stylesheet called addunitid.xsl. The second process is done with another stylesheet called ead2html.xsl.
  • ead-index.pl – Indexes all the cached/transformed EAD files by parsing out container-level elements, creating an XML stream of records of my own design, parsing the result, and passing each record on to Solr. The heart of this script is a fourth stylesheet — ead2solr.xsl
  • ead-build.sh – A shell script used to run all of the EAD-based scripts. Another ring to rule them all.

The “secret” to indexing EAD files is really no secret. I simply followed Demian Katz’s instructions. In a nutshell, to index non-MARC content the developer needs to:

  • Parse the given metadata into records. I do this with ead2solr.xsl.
  • Map each of the record’s values to as many of the underlying Solr fields as possible. Presently I only have titles and I do this through ead2solr.xsl as well.
  • Create an XML snippet representing each record and map it to the Solr fullrecord field, described below.
  • Denote a record type. I call mine ead.
  • Save the whole thing to Solr, done with ead-index.pl.

Currently, my XML snippet (Item #3) looks like this:

  <record>
    <id>unaead_id2635150</id>
    <title>Catholic Church. Archdiocese of Detroit (Mich.)
      Collection -- Catholic Church. Archdiocese of
      Detroit (Mich.): Manuscripts -- Letters -- Bp.
      Baraga to his sister Amalia
    </title>
    <date>1836/1203</date>
    <url description='View remote, canonical version of EAD'>

http://archives.nd.edu/findaids/ead/xml/det.xml

    </url>
    <url description='View local version of EAD file'>

http://zoia.library.nd.edu/sandbox/crra-data/ead/una-det.html#id2635150

    </url>
  </record>

The VUFind application provides seamless access to the index through its search box, but a bit of work needs to be done to display search results. Specifically a “record driver” needs to be written to accomodate new record types (Item #4, above). This driver inherits methods from a parent driver, IndexRecord.php, and the developer needs to override some of the methods found there with methods considering the content of the fullrecord field. Presently, the only thing I have in my record driver (EadRecord.php) is a method to extract URLs. In the future I will need to include methods to extract names of CRRA members, names of their libraries, and additional descriptive metadata.

You can see the fruits of these efforts in the CRRA “sandbox” — something we are affectionately calling “The Green Interface”.

Issues

The whole process functions and could be run automatically from cron on a daily basis, but there is plenty of room for improvement. Issues include:

  • speed – The indexing process is slower than I’d like. I think throwing more hardware thrown at the problem will make things faster.
  • invalid data and stale URLs – A small percentage of the MARC and EAD files do not include the required metadata values. No unique identifiers. Malformed MARC leaders. Non-validating EAD files and/or eadid url attributes pointing to broken locations. This is where metadata maintenance comes in.
  • character encoding – This is one of the bigger problems. Trying to figure out whether or not a MARC record has been exported as UTF-8 is difficult. Solr assumes UTF-8 and I don’t think it even knows about MARC-8. When MARC data is not encoded as UTF-8, search results look really ugly. Similarly, some of the EAD files, because of similar issues, really display poorly after they have been transformed, indexed, searched, and displayed.

None of these things are insurmountable. They will be addressed.

Next steps

My immediate next steps focus on richer search results. I need to extract additional information from the EAD files to supplement the content of my fullrecord field. After that I will explore the creation of “collection-level” records by indexing the headers of EAD files. These records will be fuller because they will have things like controlled vocabularies, scope notes/abstracts, and biographies from which to draw. Once the fullrecord fields are enhanced, I will need to go back to EadRecord.php and enhance its functionality. After that I will see about creating reports listing errors in metadata files. These reports will be designed to share with members making it easier for them to maintain their content.

All of that sounds like plenty to me. Wish me luck.

Very satisfying!

Wednesday, October 6th, 2010

I have made significant progress in the process of harvesting EAD files and preparing them for ingestion into the “Catholic Portal”. This posting outlines the successes.

Assuming a Catholic Research Resources Alliance members place their EAD files in a HTTP-accessible directory, and those files have a .xml extension, then the following Perl scripts enable me to harvest and prepare them for indexing:

  • harvest-ead.pl – reads remote HTTP-accessible directories and copies all of the .xml files found there to a local cache
  • validate.pl – makes sure the cached XML files are well-formed and conform to the EAD DTD, and if not, then move the files to a different directory
  • transform.pl – reads the validated XML files, adds id attributes to all unitid elements through the use of a stylesheet (addunitid.xsl), transforms the resulting XML into HTML using another stylesheet (ead2html.xsl), and saves the result to an HTTP-accessible directory

What was really cool and a huge time-saver was the use of ead2html.xsl. Originally named AAAv2002-HTML.xsl, found on a page called User Contributed Stylesheets, and submitted by Stephanie Ashley, this stylesheet took my id attributes and automatically made named anchors for me. Boy, did I get lucky. “Thank you, Stephanie!”

My next step is to revisit my indexing routines.

EAD @ Marquette 4 CRRA

Sunday, October 3rd, 2010

This is the briefest of travelogues reporting on a meeting about EAD files at Marquette University for the Catholic Research Resources Alliance on September 20, 2010.

marquette sights

A few members of the Alliance were previously awarded a CLIR grant to catalog previously uncataloged special collections items. These members are now doing the work using EAD (Encoded Archive Description) with the intent of sharing the resulting metadata with the “Catholic Portal”. The purposes of the meeting were to build relationships between these particular Alliance members and to discuss progress on the grant. In attendance where people from St. Catherine University (Deborah Kloiber and Emily Asch), Marquette University (Matt Blessing, Ann Hanlon, Bill Fliss, and Jean Zanoni), and the University of Notre Dame (Pat Lawton, Kevin Cawley, and Eric Lease Morgan).

Of primary concern was the particular way people were using EAD and whether or not it would lend itself to indexing by the “Portal” software. Consequently, I spent a lot of the time describing the technical infrastructure of VUFind and how it interfaced with Solr, the underlying indexer/search engine. In short, the absolute need for unique identifiers, human-readable descriptions of items, and location codes were enumerated. The former two can be garnered from the unitid and unittitle elements of a EAD did elements. The later can be gotten from the url attribute of the eadid element. Everybody was confident their EAD files would contain these values.

We then went around the table doing a bit of show & tell against our EAD. The folks of St. Catherine’s were using the Archivist’s Tool kit to “catalog” their Ade Bethune collection. Marquette University was using a Microsoft Access database to “catalog” Dorothy Day content.

Time tables where then outlined. The whole CLIR project is expected to be finished by December of 2011. Participants in attendance thought their work would be done by the end of Spring 2011, and the remaining time would be spent on putting the content onto the “Portal” as well as doing various types of publicity (conference presentations, etc.).

The meeting was over around noon, and we all retired to the faculty club for lunch. (“Thank you, Marquette!”)

In retrospect, there may be two additional issues needing to be addressed. First, I originally planned to assign or replace unitid values with locally generated, “Catholic Portal” specific values, but I have since learned that unitid information is often times used as a sort of call number and therefore necessary for location. Replacing (removing) such values from the EAD files may make work down the line more difficult. Maybe I should be getting the unique values from an id attribute of the unitid element instead?

Second, as a group we may need to decide how to encode dates. Dates can be nested within unittitle elements as well as free-standing elements in the did. Just as importantly, they can take all sort of forms. In order to make sorting and faceting feasible, the Alliance may need to figure out ways to standardize and normalize dates.

index-ead.pl

Friday, October 1st, 2010

Today I indexed some of the metadata I extracted yesterday using a script called index-ead.pl. Of all the scripts I’ve written so far, this one is the most straight-forward. Read locally-developed XML file. Extract the unique identifier, title, and date. Associate each with VUFind/Solr fields. Commit.

You can (temporarily) see the fruits of these labors because all of the records have been associated with the Eric Lease Morgan Foo Bar Library. The result is a list of container-level records with very little additional information.

By the way, as of today I am running a version of VUFind as retrieved from the development trunk, specifically, revision 3029. When upgrading from revision to revision, it is important to retain one’s config.ini file and reindex. The process is not painful, if done infrequently. As time goes on I will also need to retain locally developed hacks, such as the ones I need to write below.

The next steps are to write the MARC record driver so it does not attempt to do automatic look-ups for call numbers, but rather extracts such information from of the local index. A second next step is to write an EAD record driver to accomodate the special cases of… EAD records.