Harvesting metadata

It is imperative for CRRA member institutions to make their metadata available for harvesting via a Web server.

A couple of years ago, when the “Portal” was just beginning, the modus operandi for ingesting MARC and EAD metadata was to send it to Notre Dame, save it on local hard disk, and index it. That process worked then, but as we grow it becomes less and less scalable.

Now-a-days the preferred method of getting your metadata to the Portal is through harvesting. Here is how it works:

  1. Create metadata – Use whatever process you desire to create and edit your metadata. Much of what we suggest is outlined in a previous posting affectionately called “the recipe“.
  2. Export metadata – If your metadata is in MARC format, then query your integrated library system for all things destined for the Portal, and save the result to a single file using the UTF-8 character set. If your metadata is in EAD format, then export it as individual files making sure they are well-formed and valid.
  3. Expose metadata – In either case, MARC records or EAD files, the next step is to save the metadata on a Web server. Create or have created a directory on a Web server. Put the file of MARC records and/or the EAD files in the directory. There is no need to create a Web page. Just make sure the directory’s contents are listed automatically and by default. A good example is the work done by Marquette University.
  4. Share the URL(s) – Once the files are on a Web server, they will have URLs. In the case of MARC records, send Notre Dame the URL of the MARC file. In the case of EAD files, send the URL of the directory.
  5. Repeat – This is an never-ending process. Go to Step #1. As you create, edit, and export new or different metadata, save it in the Web-accessible directory. There is no need to send the updates to Notre Dame. They will be harvested on a regular basis. There is no need to denote which records are new, changed, or deleted. Previously indexed records will be discarded and the whole lot will be re-indexed.

There are many benefits to this process. First, the data gets duplicated. “Lot’s of copies keep stuff safe.” Second, Internet spiders and robots will find your data, index it, and make it accessible via their indexes. That is a good thing. Third, it gives you more control over the data and reduces the risk of Notre Dame loosing it.

Just like the previous “recipe”, what is described above is only an outline. Each institution will differ slightly in their implementation. If you have any questions, then please don’t hesitate to ask.

Author: Eric Lease Morgan

I am a librarian first and a computer user second. My professional goal is to discover new ways to use computers to provide better library services. I use much of my time here at the University of Notre Dame developing and providing technical support for the Catholic Research Resources Alliance -- the "Catholic Portal".