Very satisfying!

I have made significant progress in the process of harvesting EAD files and preparing them for ingestion into the “Catholic Portal”. This posting outlines the successes.

Assuming a Catholic Research Resources Alliance members place their EAD files in a HTTP-accessible directory, and those files have a .xml extension, then the following Perl scripts enable me to harvest and prepare them for indexing:

  • harvest-ead.pl – reads remote HTTP-accessible directories and copies all of the .xml files found there to a local cache
  • validate.pl – makes sure the cached XML files are well-formed and conform to the EAD DTD, and if not, then move the files to a different directory
  • transform.pl – reads the validated XML files, adds id attributes to all unitid elements through the use of a stylesheet (addunitid.xsl), transforms the resulting XML into HTML using another stylesheet (ead2html.xsl), and saves the result to an HTTP-accessible directory

What was really cool and a huge time-saver was the use of ead2html.xsl. Originally named AAAv2002-HTML.xsl, found on a page called User Contributed Stylesheets, and submitted by Stephanie Ashley, this stylesheet took my id attributes and automatically made named anchors for me. Boy, did I get lucky. “Thank you, Stephanie!”

My next step is to revisit my indexing routines.

Author: Eric Lease Morgan

I am a librarian first and a computer user second. My professional goal is to discover new ways to use computers to provide better library services. I use much of my time here at the University of Notre Dame developing and providing technical support for the Catholic Research Resources Alliance -- the "Catholic Portal".