Using OAI-PMH to populate the “Catholic Portal” is not straight-forward

May 14th, 2012

Using OAI-PMH to populate the “Catholic Portal” is not straight-forward, and this posting outlines some of my investigations in this regard.

Introduction

As you may or may not know, OAI-PMH is a “standard” protocol designed for harvesting metadata. It only understands six commands (or in OAI-PMH parlance, “verbs”). These commands are sent to remote computers in the form of URLs, and the remote computer is expected to respond in the form of specifically shaped XML streams. These commands include:

  • Identify – Lists who manages the repository and what type of content it contains.
  • ListMetadataFormats – Lists the various metadata schemes used to describe the repository’s content. At least one of these schemes must be Dublin Core.
  • ListSets – Specifies how the repository’s content is subdivided. There can be zero or more of these subdivisions.
  • ListIdentifiers – Returns a list of keys pointing to specific records in the repository.
  • ListRecords – An enhanced version of ListIdentifiers, this verb downloads whole records, not just identifiers.
  • GetRecord – Given a specific identifier, this verb retrieves a single record.

Through a conversation of these verbs and the returned XML streams, metadata between computers can be exchanged. It is then up to the computer doing the harvesting to implement some sort of cool and interesting service with the harvested content. Here at Catholic Portal Central we want to index the metadata and provide immediate access to remote digitized content.

Investigations

At least three Catholic Research Resources Alliance (CRRA) members have OAI-PMH repositories: Duquesne University, Boston College, and Loyola University Chicago. Using a little Perl script, I most recently investigated the content of the repositories of Boston College and Loyola University Chicago. Through this process I learned what metadata formats they supported, what sets were used to subdivided their collections, and output Dublin Core metadata from a few selected sets.

The harvested Dublin Core metadata was typical of OAI-PMH repositories: thin, a bit ambiguous, and somewhat inconsistant across repositories. It was thin because many of the Dublin Core elements are left unpopulated. It is ambiguous because many of the fields are repeated, and the values of repeated elements are of different types. For example, a description field may be empty, contain an abstract of the work, the full text of the work, or the process used to digitize the material. It is inconsistant because things like dates, names, and subject entries are formatted differently. In some places names are listed in first name/last name order. Other times it is last name/first name order. Dates can be anything from “February 12, 2012″ to “2012-02-12″ to “Twelfth Century”. None of this is new the world of OAI-PMH. It is typical.

All is not lost. There are patterns to this apparent randomness. Using my script I can sometimes output titles, descriptions, subject headings, and URLs of digitized objects. For example, here is such a list from the Loyola University Chicago repository:

item: 46

key: oai:content.library.luc.edu:coll6/45

title(s): Letter to the Secretary of the Literary Agency of London, 1908
title(s): Catholic Women Poets

identifier(s): cudahy219e3

identifier(s): 003_kayesmith_1908;pg3.jpg

identifier(s): http://content.library.luc.edu/u?/coll6,45

subject(s): Shelia Kaye-Smith; poets; women poets; Catholic poets

subject(s): Local

description(s): third page of letter requesting appointment

description(s): does not suit you any other time up to 4 15 will do Would you kindly send a reply to me c o Miss F E Walters Girton College Cambridge With apologies for troubling you believe me Yours faithfully Sheila Kaye Smith

description(s): Master file scanned at 600 dpi RGB in reflective mode from original document using MicroTek ScanMaker 1000XL

description(s): http://www.luc.edu.archives

type: image

From this output it becomes apparent that the first title is the title of the artifact, the third identifier is the URL of the digitized object, the first subject field is a delimited list of keywords, the first description is a sort of abstract, and the type field contains a value denoting what kind of digitized thing is in question. Thus, the output follows a pattern, and computers are very good at patterns, therefore a computer program could easily be written to read this particular OAI-PMH output and stored in the Portal’s index.

Next steps

My next steps are two-fold. First, I will harvest and index some of the metadata from selected Loyola University Chicago OAI-PMH sets. Second, I will let colleagues from various CRRA committees (specifically the Digital Access Committee as well as the Collection Committee) peruse the results. In the end I hope to get feedback on how to proceed. Should I index more content? Less? None? If more, then how should records be displayed, and exactly how ought the Dublin Core metadata be mapped to VuFind’s underlying Solr index fields?

All of this work is entirely feasible. At the same time it is not enormously scalable. Hand-crafting the parsing of OAI-PMH output, and handcrafting how it all gets mapped to Solr’s index is time consuming and fragile. The Portal Home Planet can easily do this work for no more than a dozen different repositories, but after that some other means of production will need to be examined.

April 2012 Update

April 23rd, 2012

CRRA UPDATE

April 2012

This month’s update includes:

  • A Focus on Members, from Janice Welburn, Chair, CRRA Board of Directors
    To guide us in developing effective strategies for successful member engagement, the Board has set up a Membership committee and I’m delighted to welcome a current Board member, Evelyn Minick, University Librarian, Saint Joseph’s University, as the chair The Committee’s major objectives are to grow the membership and ensure retention of current members …
  • CRRA Collections Spotlight: The Philadelphia Archdiocesan Historical Research Center Catholic Newspaper Collection, by Shawn Weldon
    The Philadelphia Archdiocesan Historical Research Center (PAHRC) holds one of the largest collections of Catholic newspapers in the United States …
  • Update on the Digital Access Committee (DAC), from Demian Katz, DAC Chair
    In spite of changes, DAC has pressed forward with several initiatives.  The Catholic Portal, still the centerpiece of CRRA’s website, is under continuous improvement, both in response to member feedback gathered during usability testing and due to new features in the underlying VuFind software …
  • Mark Your Calendars: All-Members Meeting, Anaheim, CA, June 25-26, 2012, all are invited;
    Archival Networks and EAD Consortia at SAA in August (San Diego); Fall Symposium at DePaul University, Oct. 15-16, 2012
  • Position Announcement:Duquesne University

A Focus on Members
from Janice Welburn Chair, CRRA Board of Directors

 The new strategic plan affirms the importance of a strong value proposition for members.  To guide us in developing effective strategies for successful member engagement, the Board has set up a Membership committee and I’m delighted to welcome a current Board member, Evelyn Minick, University Librarian, Saint Joseph’s University, as the chair.  Evelyn’s deep commitment to our mission, keen insights into member expectations and effective leadership of the task force that developed a multi-tiered dues schedule, make her an excellent choice to guide our membership development and support. While we may add other members over time, I am pleased to announce the initial membership:

  • Kris Brancolini, Dean of the Library, Loyola Marymount University, Los Angeles
  • Theresa Byrd, University Librarian, University of San Diego; also a Board member
  • Melody McMahon, Director of the Paul Bechtold Library, Catholic Theological Union, Chicago
  • Tom Messner, Library Director, Barry University, Miami Shores, FL
  • Laverna Saunders, Library Director, Duquesne University, Pittsburgh
  • Bob Seal, Dean of Libraries, Loyola University Chicago
  • Kathy Webb, Dean of University Libraries, University of Dayton
  • Jennifer Younger, ex officio, Executive Director, CRRA

The Committee’s major objectives are to grow the membership and ensure retention of current members.  It is advisory to the Board.  Although the Committee plays a central role, it is important to emphasize that the Committee will consult broadly with members on needs and expectations of membership, as well as actively seek suggestions from individuals and committees on prospective members.  We want to continue our participative tradition of reaching out to potential members as noted in our protocol for inviting new members.  The charge to the Membership Committee will be accessible shortly along with the full roster on our website.

 


 THE PHILADELPHIA ARCHDIOCESAN HISTORICAL RESEARCH CENTER

CATHOLIC NEWSPAPER COLLECTION

 The Philadelphia Archdiocesan Historical Research Center (PAHRC) holds one of the largest collections of Catholic newspapers in the United States. These newspapers were collected by the American Catholic Historical Society of Philadelphia which was founded in 1886 to collect material documenting the history of Catholicism in the United States. The ACHS collections, including manuscripts, newspapers, periodicals, pamphlets, books, artifacts and graphic material, were given to the Archdiocese of Philadelphia in the 1930’s. In 1989, the ACHS Collection was merged with the Archives of the Archdiocese of Philadelphia to form PAHRC.

The newspaper collection contains Catholic newspapers from throughout the United States as well as some Catholic newspapers from Canada, England, Ireland, France and Italy.  The collection contains over 300 titles, representing 35 states and the District of Columbia, and covers the period primarily from the 1820’s through the 1940’s. The bulk of the collection dates from the 1840’s through the 1920’s.

Included are early and prominent Catholic newspapers such as The Catholic Press/The United States Catholic Press (Hartford), The Catholic Miscellany (Charleston), The Catholic Herald (Philadelphia), The Catholic Mirror (Baltimore), The Catholic Advocate (Louisville), The Pilot (Boston), The Catholic Telegraph (Cincinnati) and The Freeman’s Journal (New York City). The collection also contains many ethnic newspapers, including Irish-American, German-American and Polish-American newspapers, as well as newspapers published for a juvenile audience, society newspapers and papers published for the support of Catholic institutions.

Notable are some of the first black Catholic newspapers published in the United States. There is a good run of the American Catholic Tribune, originally published in Cincinnati and later in Detroit, for the years 1887-1894. There are some issues of The Journal, a Philadelphia black Catholic newspaper that was published for a few months in 1892. The collection also includes Volume I, Number 1 (February 18, 1905) of The Catholic Herald, a black Catholic newspaper in Washington, D.C. which may be the only issue published. For more information on black Catholic newspapers and periodicals in the PAHRC collection see the following: http://www.pahrc.net/index.php/black-catholic-periodicals/

The collection also contains other rare titles such as Sina Sapa Wocekiye Taeyanpaha, a North Dakota newspaper published in the Sioux language, The Catholic Visitor (Richmond, Virginia), The New Jersey Catholic Journal (Trenton, New Jersey) and perhaps the only significant collection of Redpath’s Illustrated Weekly, a primarily Irish national newspaper published in New York City by the journalist and social activist James Redpath.

Despite the size and research value of the collection, there are some issues that impact its usefulness. Although there are very large runs of issues for the major newspapers, none of the titles is complete and there are gaps and issues missing. Most of the rarer newspapers may contain only a few issues or a few years of the paper. The most pressing issue is that the collection is maintained primarily in hard copy and a significant number of the newspapers are in very fragile condition and in need of immediate conservation and preservation. One of the advantages of membership in the CRRA is the opportunity to cooperate with other repositories facing these same issues to create a comprehensive online inventory and directory of North American Catholic newspapers and to facilitate the eventual digitization of the various collections. One of my goals as a member of the CRRA Newspapers Taskforce is to assist in the realization of these projects.

To view the contents of the newspaper collection at PAHRC see the following:  http://pahrc.pastperfect-online.com/30664cgi/mweb.exe?request=clicksearch;dtype=d;subset=0;_t1101=newspapers

–Shawn Weldon, PAHRC

Member, Collections Committee

Member, Catholic Newspapers Task Force

 


Update on the Digital Access Committee (DAC), from Demian Katz, DAC Chair

The Digital Access Committee has had some recent membership changes, bidding farewell to Ann Hanlon (Marquette) and welcoming new member Megan Bernal (DePaul).

In spite of changes, DAC has pressed forward with several initiatives.  The Catholic Portal, still the centerpiece of CRRA’s website, is under continuous improvement, both in response to member feedback gathered during usability testing and due to new features in the underlying VuFind software used to run it.  Additionally, DAC has begun looking at some new software that can be used to expand and improve CRRA’s online presence.  The Concrete5 Content Management System is an open source tool for building websites, and DAC hopes to use it to improve the quality and simplify the maintenance of the informational pages that accompany the Catholic Portal on catholicresearch.net.

Archon and Archivists’ Toolkit are both packages for building EAD files for archival description, and DAC has been weighing the benefits of installing one of these packages to help members build finding aids.  Finally, DAC is also preparing to support the Newspapers Task Force as needed as efforts progress.

 


 

Mark Your Calendars!  Upcoming Events

 All-Members Meeting

Anaheim, CA

June 25-26, 2012


CRRA colleagues,
As you make plans for the ALA conference and/or the following ATLA conference, we hope you will also make time for the CRRA All-Members meeting.  This announcement also appeared in the CRRA March Update but we thought it might be good to send it out again after the Easter holiday. –Jennifer

You are invited to the annual All-Members meeting.  While we don’t know specific locations at this time, we will hold our events in easily accessible locations. The Anaheim Resort Transit Trolley has numerous routes connecting hotels, restaurants, shops, convention center and the Crystal Cathedral, and we will provide directions for getting to CRRA events.  Later we will ask for RSVP’s from those attending Monday’s dinner and/or Tuesday’s meeting and/or lunch so as to provide appropriately for a dinner reservation, and for breaks and lunch on Tuesday. On Monday evening, June 25, we will meet for dinner at a casual restaurant. We meet about 6:30.  We will make a group reservation.

We meet on Tuesday, June 26, from 9:00 a.m. through 12:30 p.m. followed by lunch (optional).  Our agenda is focused on mission-support for the next year: identifying top priorities, ideas for forming local teams and expanding our understanding of Catholic Studies.  With the announcement that the Board has adopted a five year strategic plan, we will be asking committees to develop their annual goals in this context and will be inviting all members to participate in identifying high priorities for the coming year.


Agenda

·         Welcome, Janice Welburn, chair, Board of Directors

·         Annual goals, objectives and priorities – Moderator, Pat Lawton

·         Forming institutional teams – Panel discussion TBA

·         Catholic Studies and challenges facing Catholic educators – Rev. James Heft, S.M. President, Institute for Advanced Catholic Studies at the University of Southern California and Member, CRRA Leadership Council

We look forward to meeting with as many of you as can be there. Please share this invitation with any others at your institution who may also be in Anaheim.  Traditionally, our meetings are open to others interested in our mission and activities. If you know of others who might like to attend, you can share this information or request that Pat or Jennifer do so.  See you there.

Jennifer Younger
CRRA Executive Director


Attending SAA in San Diego in August?  This session on Networks and EAD Consortia may be of interest:

Archival Networks and EAD Consortia

EAD consortia and aggregators of archival resources share broad interests in the ongoing exchange of information about each others’ projects and programs.  Why reinvent the wheel?

Where: SAA 76th Annual Meeting, San Diego Hilton Bayfront — room to be determined.  Please consult conference program for location details, once available.

When: Thursday, August 9, 2012, 12:00-1:15 pm

Goal: to increase communication across consortia, in order to share expertise and develop a common vision for broader archival description and discovery networks.

Agenda: brief regional/statewide/national program updates, followed by structured discussion.  Additional agenda details forthcoming.

Anyone interested is welcome to attend.

Jodi Allison-Bunnell, Orbis Cascade and NWDA
Jennifer Schaffner, OCLC Research
Adrian Turner, Online Archive of California and the California Digital Library

Fall Symposium at DePaul University, Oct. 15-16, 2012
Continuing on the success of the November 2011 Duquesne Symposium, plans are underway for a Fall Symposium to be held at DePaul University Oct. 15-16.  Please hold this date, and watch the CRRA Update for further details.


Position Available: Reference & Instruction Librarian, Duquesne University

 Duquesne University

Gumberg Library

Reference & Instruction Librarian
NATURE OF WORK:
This non-tenured library faculty position reports to the Director of Information Services.  This is primarily a public service position with significant instructional and liaison duties.  Knowledge of information sources, interpersonal skills, instructional skills, and technology skills are of highest importance for this position.  Provides reference service and instruction to enable members of the Duquesne University community and guest users to find and effectively make use of library resources and other information sources.

For the full posting, please see: http://www.duq.edu/hr/faculty/faculty-jobs-openings/gumberg.cfm

 


CRRA Update is an electronic newsletter distributed via email to provide members with an update of CRRA activities.  Please contact Pat Lawton at 574.631.1324 or email plawton@nd.edu with your questions, comments, or news to share. We welcome your news items!

———
CRRA Calendar: http://tiny.cc/Calendar798
CRRA Contact page: http://www.catholicresearch.net/About/Contact
CRRA blog: http://www.catholicresearch.net/blog/

Statistical reports against the “Catholic Portal”

April 17th, 2012

This text describes the beginnings of a set of statistical reports describing the use of the “Catholic Portal“.

More specifically, the Portal’s Web server log files are read on a daily basis, normalized, and saved to an underlying database. A number of queries are then applied to the database to create rudimentarily lists of tabulations. Each one of the reports are described below:

  • Hosts – This report lists the Internet address or name of the top 100 computers using the Portal. To the best of our ability, the list excludes Internet robots and spiders, but the list needs to be updated. As of this writing, it is quite likely that many of the top computers are still robots, and the host named university.archives.nd.edu is probably the most frequent user of the Portal with shunat236-189.shu.edu coming in at a close second.
  • Page count – This is a list of the number of hits the Portal received on any given day. Obviously the script creating this report needs to be updated in order to output data for the current year.
  • Query strings – This is a tabulation of the most frequently used search terms applied against the Portal. The “null” query is probably a simple hit against the “browse” link at the bottom of the Portal’s home page and/or simply clicking the search box’s Find button. The queries in quotes are probably from clicks on hot linked search results.
  • Referrers – This is a list of the websites where people came from before they visited the Portal. A whole lot of these websites are places where blog postings about the Portal appear. Many are spam. Some are HTML versions of the EAD finding aids. Further down the list one can begin to see Google searches.
  • Referrers engines – This report is just exactly like the Referrers report except it only includes search engines (Google, Yahoo, and Bing).
  • Tabs – This is a list of the most frequently used links used across the top of the Portal’s home page.
  • Top records – This is a tabulation of the most frequently viewed records in the Portal. The first item on the list is an error, but as of this writing the most frequently viewed record is something from Catholic University of America.
  • Types of searches – From this report is all but obvious that the overwhelming majority of the searches applied against the Portal are free text searches. Nobody uses the advanced search form.
  • Whose records – This is a list of the names of the libraries/institutions whose records are viewed most frequently.

For a more technical description of how these reports are generated, see the blog posting entitled “Data warehousing Web server log files” as well as a follow-up posting called “Progress with statistics reporting“.

These reports can be improved in any number of ways. First, they could be represented graphically — pie charts, histograms, etc. Second, they could be re-generated on a month-by-month basis to look for trends over time. Luckily just about all the necessary data has been preserved. Alternatively, a peek at the Portal’s Google Analystics site may illuminate additional trends.

Transforming schema-based EAD files

April 10th, 2012

This posting describes my solution for transforming schema-based EAD files for the “Catholic Portal”. In a sentence, the solution boils down to removing the all the namespaces from the input.

For the longest time the EAD files harvested for the Portal were validated against the EAD DTD. These files have no namespace declarations, and transformations were relatively easy. It was almost trivial for me to add unitid attributes to did-level elements. It was almost trivial for me to loop through the input files to extract did-level elements for indexing. Using a stylesheet I found through the Library Of Congress, it was easy for me to convert the EAD into an HTML file for online reading.

When I started getting EAD files generated from the venerable Archivist’s Toolkit my processes broke because these new files were validated against EAD schema which is full of two or three namespaces. None of my XPath statements worked. A number of people offered a number of suggestions. Some of them required the use of XSLT 2.0, which is not an option for me. Others thought I should update my existing stylesheets to accomodate the namespaces, but that would have been too complicated and not scalable.

In the end, I chose a different solution which was alluded to by a number of other people — remove the namespaces. Each person offered a slightly different take on the problem, but in the end I went for a brute force method I found in the TEI community Web space:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="no" />
  <xsl:template match="/|comment()|processing-instruction()">
    <xsl:copy>
      <xsl:apply-templates />
    </xsl:copy>
  </xsl:template>
  <xsl:template match="*">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="@*|node()" />
    </xsl:element>
  </xsl:template>
  <xsl:template match="@*">
    <xsl:attribute name="{local-name()}">
      <xsl:value-of select="." />
    </xsl:attribute>
  </xsl:template>
</xsl:stylesheet>

Consequently, my XML processing pipeline now looks this:

  1. harvest EAD files
  2. validated them
  3. strip namespaces
  4. add unitids
  5. transform them into HTML
  6. index them
  7. done

The next thing to do is improve Step #5 since the generic EAD to HTML transformation is just that — too generic.

Moving to VuFind version 1.3

March 23rd, 2012

We here at “Catholic Portal Central” are spending time and effort moving to VuFind version 1.3. To this end I have implemented a number of things as per our usability studies as well as begun to skin the underlying “blueprint” theme. Give it a whirl and share your thoughts — http://vufind.library.nd.edu

Linking CRRA items to member libraries: A Prototype

March 19th, 2012

I have implemented a prototype for linking items found in the “Catholic Portal” to CRRA member institutions.

The Problem

The vast majority of the content in the “Portal” is not digitized. Consequently, when items of interest are identified, the reader is left hanging because the Portal does not support document delivery. “Now that I have found this item, how do I get it?”

The Solution

The solution is not perfect, but rather a step in the right direction. Instead of delivering the item, the solution is to provide a means for the reader (I don’t use the word “user” anymore) to easily connect with the member institution libraries through a directory. Specifically, create a directory of member institution libraries/archives complete with names, addresses, and other pieces of contact information. Hyperlink each and every search result to specific entires in the directory and thus enable readers get in touch with member institutions.

I have implemented this in the Portal’s “sandbox”. Search for any item, and from both the search result page as well as detail holdings pages, the reader can click on the institutions’ library and be shown a (bogus) directory.

The implementation was much easier than I anticipated, and the key was found in the identifiers of each indexed record. (All puns intended.) Each indexed record in the Portal is prefixed with a code denoting the library holding the item. For example, Boston College’s code is bcu, and Loyola Marymount University’s code is lmu. When search items are returned VuFind’s IndexRecord record driver is called. In that code I am able to extract each record’s identifier, and parse out is first three characters — the code. I then pass this identifier and the library’s name on to my template for display:

$interface->assign('CRRALibrary', $this->fields['building'][0]);
$interface->assign('CRRAKey', substr ($this->fields['id'], 0, 3 ));

In the template I hyperlink the holding library’s name with the directory’s URL, and specifically, a named anchor for the library:

<a href='http://zoia.library.nd.edu/tmp/directory.html#{$CRRAKey}'>{$CRRALibrary}</a>

The directory I created was rudimentary at best, and it will be up to people other than me and including myself to determine how the directory gets created and what it looks like.

Hooray for open source software and object oriented programming techniques!

Prioritized list of fixes/enhancements for the “Portal”

March 7th, 2012

Based on our usability studies and conference call from the other day I have created a (more or less) prioritized list of fixes/enhancements to be applied to the “Portal”:

  • add a a note to the email dialog box denoting how the from field is mandatory and requires an email address
  • create a directory of institutions, and from search results hyperlink institutions’ names to the directory
  • update the “Portal” look & feel (theme) so it is based on the “blueprint” theme
  • turn off the “Suggested Topics” feature
  • fix the author searches so when author names are clicked the content displays correctly
  • make the login links float to the right instead of the left
  • change the red text — such as the text in the search box — to black
  • change the login label to read “Login / Create account”

On my mark. Get set. Go.

How to make MARC and EAD metadata available in the “Catholic Portal”

February 22nd, 2012

This is a set of (draft) prescriptive instructions describing how to make MARC and EAD metadata available in the “Catholic Portal“.

Introduction

At its core, the “Portal” is an index — a list of pointers to content items. Access to this index is implemented through a form-based interface. Readers enter queries into the form, and items are returned. Readers are then expected to select items of interest from the returned list, and use them for the purposes of research and scholarship. In order to implement this functionality, each content item in the index requires, at the very least, three elements: 1) a unique identifier, 2) a human-readable description of the item, and 3) a location code where the item can be acquired.

The MARC and EAD metadata schemes are well-suited for indexing. After making sets of MARC records and/or EAD files transparently accessible on a Web server, it is easy to harvest the metadata, integrate it into the Portal’s index, and provide access to the content items.

The balance of this posting describes how to make MARC and EAD files available for harvesting.

MARC

Here’s the short version. Export all the MARC records from your integrated library system you think are apropos to the “Catholic Portal” making sure they are encoded using the UTF-8 character set. Save the resulting file on a Web server, and tell Eric Morgan the URL of the resulting file. Eric will do the rest.

Here’s the long version. Remember, every record in the Portal needs a unique identifier, a human-readable description, and a location code. For MARC records, this means every record first needs a value in the 001 field. Any value will do as long as it is unique to your set of records. Second, each MARC record needs something in the 245 field. At the very least this will be the human-readable description. All the other descriptive and analytic fields will supplement this description. Third, each MARC record needs to have a location code, and this is the item’s call number. This value will most likely be extracted from the 090 field.

Helping you decide which MARC records to extract from your integrated library system is beyond the scope of this document. But once you have figured that out it is recommended you denote which items are to be extracted by updating them with a local note. Here at the University of Notre Dame, we put the letters CRRA in field 590 subfield a. Once this is done it is relatively easy for the systems librarian to do a search for CRRA in field 590 subfield a, and dump the resulting records to a file. Alternatively, the systems librarian might search for all items whose call numbers begin with BX and dump the resulting set. The process you use to denote and export your MARC records depends on your local environment.

When exporting your MARC records from your integrated library system, it is imperative the records be encoded using the UTF-8 character set and not something else. The Portal’s underlying indexer does not deal very well with encodings of another kind. If your system does not export records as UTF-8, and it exports things in MARC-8 instead, then use an open source application called yaz-marcdump from Index Data to transform your records from one encoding into another. Once yaz-marcdump is installed you can execute a command like the following to do the transformation:

yaz-marcdump -f MARC-8 -t UTF-8 -o marc -l 9=97 input.mrc > output.mrc

The command translates MARC records from (-f) MARC-8 encoding to (-t) UTF-8 encoding. It outputs (-o) the result as MARC records, and inserts the letter a (ASCII character 97) into the leader (-l) at position 9. It uses the file named input.mrc as input, and it outputs the result to a file named output.mrc.

Every time you export your records, you should export everything that you feel is relevant to the portal. Do not worry about additions, changes, nor deletions. We here at Portal Central handle this issue by deleting all of your records locally and re-indexing the whole lot.

After the records have been exported, save them on a Web server, and finally, tell Eric Morgan the URL of the resulting file. Please don’t change the name of the URL. Eric will harvest the records and incorporate them into the index. As of this writing it is a good idea to tell Eric when new records are available, but at some point in time this won’t be necessary.

EAD

Here’s the short version. Use validated EAD files to encode the content you deem apropos to the Portal. Save all the EAD files in a single directory on a Web server making sure each file is given a .xml extension. Tell Eric Morgan the URL of the directory, and he will take care of the rest.

Here’s the longer version. Use whatever tool you desire to create EAD files describing the archival content you deem appropriate for the Portal. There are any number of available editors and applications facilitating this process. Make sure the resulting EAD files validate against the EAD DTD or schema. It doesn’t really matter which one, but right now validation against the DTD is easier to handle here at Portal Central.

Each did-level element in your EAD files will eventually become a record in the Portal’s index. During pre-processing here at Portal Central, unique unitid attributes will be added to each did-level element, if no unitid attributes exist in the first place. This pre-processing satisfies the need for unique identifiers. You need to do nothing in regards to unique identifiers.

Each did-level unittitle element will recursively be combined with its parent did/unittitle element to form a human-readable description of each content item. Consequently, there is nothing you need to do in regards to human-readable descriptions.

The location of items found in EAD files is facilitated in three ways. First, the name of your hosting institution and library/archive will be associated with each search result, thus the need for location information will be satisfied but only in a rudimentary way. Second, through the use of the url attribute of the eadid element, location information is re-enforced. Specifically, you are expected to include a value in the url attribute of the eadid element. This value is expected to point to a human-readable version of your EAD file on your Web server. Portal search results include hot links with a label similar to “View finding aid at owning institution”. The hot links will be the same as the value in the url attribute. Your human-readable version of the EAD file is then expected to include instructions and contact information describing how to acquire items of interest. Finally, search results will include a second hot link labeled similar to “View finding aid in Portal display”. These hot links will equal to a URL pointing to a local HTML file transformed from the original EAD. Again, location and contact information should be a part of the HTML because it was a part of the original EAD.

In summary, create complete and valid EAD files making sure you include values in the url attributes of the eadid elements.

Once you have created your EAD files, save them in a single directory on a Web server, and tell Eric Morgan the URL of the directory. Make sure each EAD file ends with a .xml extension. Eric will then regularly harvest all the .xml files from your directory, re-validate them, make sure they include url attributes, add unique identifiers to each did-level element, and index each did-level element.

Philadelphia Archdiocesan Historical Research Center (PAHRC) records

February 7th, 2012

Just less than 1,100 records from the Philadelphia Archdiocesan Historical Research Center (PAHRC) have been added to the “Portal” — http://bit.ly/uG92RG

Content from the University of Dayton

January 16th, 2012

Twenty-nine records from the Archives at the University of Dayton added to the “Catholic Portal” — http://bit.ly/weVl8h