Transforming schema-based EAD files

This posting describes my solution for transforming schema-based EAD files for the “Catholic Portal”. In a sentence, the solution boils down to removing the all the namespaces from the input.

For the longest time the EAD files harvested for the Portal were validated against the EAD DTD. These files have no namespace declarations, and transformations were relatively easy. It was almost trivial for me to add unitid attributes to did-level elements. It was almost trivial for me to loop through the input files to extract did-level elements for indexing. Using a stylesheet I found through the Library Of Congress, it was easy for me to convert the EAD into an HTML file for online reading.

When I started getting EAD files generated from the venerable Archivist’s Toolkit my processes broke because these new files were validated against EAD schema which is full of two or three namespaces. None of my XPath statements worked. A number of people offered a number of suggestions. Some of them required the use of XSLT 2.0, which is not an option for me. Others thought I should update my existing stylesheets to accomodate the namespaces, but that would have been too complicated and not scalable.

In the end, I chose a different solution which was alluded to by a number of other people — remove the namespaces. Each person offered a slightly different take on the problem, but in the end I went for a brute force method I found in the TEI community Web space:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="no" /> <xsl:template match="/|comment()|processing-instruction()"> <xsl:copy> <xsl:apply-templates /> </xsl:copy> </xsl:template> <xsl:template match="*"> <xsl:element name="{local-name()}"> <xsl:apply-templates select="@*|node()" /> </xsl:element> </xsl:template> <xsl:template match="@*"> <xsl:attribute name="{local-name()}"> <xsl:value-of select="." /> </xsl:attribute> </xsl:template> </xsl:stylesheet> 

Consequently, my XML processing pipeline now looks this:

  1. harvest EAD files
  2. validated them
  3. strip namespaces
  4. add unitids
  5. transform them into HTML
  6. index them
  7. done

The next thing to do is improve Step #5 since the generic EAD to HTML transformation is just that — too generic.

Author: Eric Lease Morgan

I am a librarian first and a computer user second. My professional goal is to discover new ways to use computers to provide better library services. I use much of my time here at the University of Notre Dame developing and providing technical support for the Catholic Research Resources Alliance -- the "Catholic Portal".