I have made significant progress in the process of harvesting EAD files and preparing them for ingestion into the “Catholic Portal”. This posting outlines the successes.
Assuming a Catholic Research Resources Alliance members place their EAD files in a HTTP-accessible directory, and those files have a .xml extension, then the following Perl scripts enable me to harvest and prepare them for indexing:
- harvest-ead.pl – reads remote HTTP-accessible directories and copies all of the .xml files found there to a local cache
- validate.pl – makes sure the cached XML files are well-formed and conform to the EAD DTD, and if not, then move the files to a different directory
- transform.pl – reads the validated XML files, adds id attributes to all unitid elements through the use of a stylesheet (addunitid.xsl), transforms the resulting XML into HTML using another stylesheet (ead2html.xsl), and saves the result to an HTTP-accessible directory
What was really cool and a huge time-saver was the use of ead2html.xsl. Originally named AAAv2002-HTML.xsl, found on a page called User Contributed Stylesheets, and submitted by Stephanie Ashley, this stylesheet took my id attributes and automatically made named anchors for me. Boy, did I get lucky. “Thank you, Stephanie!”
My next step is to revisit my indexing routines.