Internet Archive content, VUFind (Solr), and text mining

The posting outlines how I have: 1) mirrored metadata and full text content from the Internet Archive, 2) made the mirrored content accessible through VUFind, and 3) implemented a rudimentary text mining interface against the mirror.


The “Catholic Portal” is intended to be a research tool centered around “rare, unique, and uncommon” materials of a Catholic nature. Many of these sorts of things are older as opposed to newer, and therefore, many of these things are out of copyright. Projects such as Google Books and the Open Content Alliance specialize in the mass digitization of out of copyright materials. By extension we can hope some of the things apropos to the Portal have been digitized by one or more of these projects.

Very recently St. Michael’s College in the University of Toronto has become a member of the Catholic Research Resources Alliance, and consequently, they desire to contribute to the Portal. As it just so happens, the University of Toronto has been a big proponent of mass digitization. They have been working with the Open Content Alliance for quite a while. Much of their content, including content from St. Michael’s, has been digitized. Complete with MARC records, PDF files, and plain text these digital artifacts are freely available for downloading. Moreover, the availability of full text content opens up the doors to all sort of text mining and digital humanities computing techniques in library “discovery systems”. Collocations. Word clouds. Graphing and mapping. Concordancing. Etc. As an example of one such discovery system, the Portal not only provides access to the content, but it can also make the content useful.

With input from Dave Hagelaar, Pat Lawton, and Remi Pulwer I implemented all of the things above, to some degree. The balance of this posting describes how.

The Process

Dave Hagelaar from St. Michael’s College sent me a set of around 600 Internet Archive unique identifiers from their collection representing “rare, unique, and uncommon” materials. Based on previous work, I was able to harvest the metadata, mirror the content, and integrate the whole into our VUFind interface. The process included the following steps:

  1. Convert identifiers – Each of the Internet Archive identifiers (keys) represent a Web page complete with metadata and links to digital content. The identifiers look something like this: delancienneetdel00rich. Given this information sets of URLs can be constructed pointing to locations at the Archive. Creating a set of URLs based on the list of keys was done with a trivial Perl script called The resulting URL look like this:
  2. Mirror content – The next step was to copy the remote data locally — mirror it. This was done using the venerable wget program. Essentially, wget is called with a very long set of parameters as well as the output from Step #1. The result is a local cache of MARC, PDF, and plain text files. Since these files were saved in their own directory on an HTTP file system, each file has its own URL. To make life easier, the running of wget with all of its parameters was implemented as a simple shell script —
  3. Enhance MARC records – Given the additional locations of the mirrored content, the MARC records harvested from the Internet Archive were not complete. They did not include URLs pointing to the Internet Archive, nor did they include the URLs pointing to the local cache. Consequently the next step was to enhance the MARC records. This was done with a second Perl script called, but the script does more. Since we hoped to provide text mining services against the full text, a third URL needed to be included in the MARC pointing to the text mining interface. Finally, since the text mining application needs a bit of metadata itself, a rudimentary database listing the full text items is created along the way. This entire subprocess was complicated by the fact that not all of the harvested MARC records were valid. Because of character encoding issues, some of them were not readable by my MARC record parser (MARC::Batch). Some of the records are structurally incorrect. Invalid leaders and misplaced record/field/subfield delimiters. Finally, some of the records apparently included invalid values for various indicators. To make sure the database was as clean as possible, any record generating any sort of error was not included in the final processing. This left approximately 400 of the original 600 records.
  4. Index MARC records – The next step was to ingest the MARC records into VUFind’s underlying Solr index. This was done with a Perl script called and described in a previous posting. With the completion of this step, the content provided by St. Michael’s College became available in the Portal. Search or browse the Portal for records. Find items from St. Michael’s. Click on a link to get the content from the Internet Archive. Click on another link to retrieve it from the local cache. For example, see the record for Letters of an Irish Catholic layman.
  5. Support text mining – The final step in the process deserves a blog posting in its own right, and thus only a summary will be provided here. At its foundation, text mining surrounds the process of counting ngrams whether they be single letters, single syllables, multiple syllables, individual words, multi-word phrases, sentences, etc. Once these things are counted they can be measured. Once they are measured, patterns can be sought, and if patterns are found, then overarching descriptions can be articulated resulting in the creation of new knowledge or an increase in understanding. When coupled with concordances, ngrams can be placed within the context of the larger work to learn how they were used. Using two Perl modules (Lingua::EN::Ngram and Lingua::Concordance) a simple Web-based interface was written allowing the scholar to list the most frequent ngrams in a text, map their relative locations in it, and read snippets of text surrounding them. Using this technique it is possible to quickly and easily get an overview of the content of a document. The text mining application I created is initialized with an Internet Archive identifier. The application reads the identifier, looks up the location of the locally cached plain text file, reads it into memory, and allows the researcher to do “distant reading” against it. Unfortunately Lingua::Concordance only works sporadically against non-English files, but you can still see how the system works by using the concordance against Letters of an Irish Catholic layman.


The process outlined above described how full text content can be harvested from the ‘Net and integrated into the VUFind “discovery system”. The key to doing this easily was the existence of metadata (MARC records) describing the harvested items. Without this metadata the process would have been too laborious. The process also outlined how the harvested full text can be put to greater use through a simple text mining interface.

Software is never done. If it were, then it would be called hardware. Consequently, there are many ways the process can be improved. Examples include figuring out ways to repair broken MARC records, and updating Lingua::Concordance to work correctly with foreign language materials. Maybe I should call this job security.

This entry was posted in Tech Issues/Tips by Eric Lease Morgan. Bookmark the permalink.

About Eric Lease Morgan

I am a librarian first and a computer user second. My professional goal is to discover new ways to use computers to provide better library services. I use much of my time here at the University of Notre Dame developing and providing technical support for the Catholic Research Resources Alliance -- the "Catholic Portal".

Comments are closed.