Limit to full text in VuFind

This posting outlines how a “limit to full text” functionality was implemented in the “Catholic Portal’s” version of VuFind.

While there are many dimensions of the Catholic Portal, one of its primary components is a sort of union catalog of rare and infrequently held materials of a Catholic nature. This union catalog is comprised of metadata from MARC records, EAD files, and OAI-PMH data repositories. Some of the MARC records include URLs in 856$u fields. These URLs point to PDF files that have been processed with OCR. The Portal’s indexer has been configured to harvest the PDF documents, when it comes across them. Once harvested the OCR is extracted from the PDF file, and the resulting text is added to the underlying Solr index. The values of the URLs are saved to the Solr index as well. Almost by definition, all of the OAI-PMH content indexed by Portal is full text; almost all of the OAI-PMH content includes pointers to images or PDF documents.

Consequently, if a reader wanted to find only full text content, then it would be nice to: 1) do a search, and 2) limit to full text. And this is exactly what was implemented. The first step was to edit Solr’s definiton of the url field. Specifically, its “indexed” attribute was changed from false to true. Trivial. Solr was then restarted.

The second step was to re-index the MARC content. When this is complete, the reader is able to search the index for URL content — “url:*”. In other words, find all records whose URL equals anything.

The third step was to understand that all of the local VuFind OAI-PMH identifiers have the same shape. Specifically, they all include the string “oai”. Consequently, the very astute reader could find all OAI-PMH content with the following query: “id:*oai*”.

The third step was to turn on a VuFind checkbox option found in facets.ini. Specifically, the “[CheckboxFacets]” section was augmented to include the following line:

id:*oai* OR url:* = “Limit to full text”

When this was done a new facet appeared in the VuFind interface.

Finally, the whole thing comes to fruition when a person does an initial search. The results are displayed, and the facets include a limit option. Upon selection, VuFind searches again, but limits the query by “id:*oai* OR url:*” — only items that have URLs or come from OAI-PMH repositories. Pretty cool. Catholic Portal's version of VuFind

Kudos go to Demian Katz for outlining this process. Very nice. Thank you!

Catholic Pamphlets and the Catholic Portal: An evolution in librarianship

cover pageThis blog posting outlines, describes, and demonstrates how a set of Catholic pamphlets were digitized, indexed, and made accessible through the Catholic Portal. In the end it advocates an evolution in librarianship.

A few years ago, a fledgling Catholic pamphlets digitization process was embarked upon. [1] In summary, a number of different library departments were brought together, a workflow was discussed, timelines were constructed, and in the end approximately one third of the collection was digitized. The MARC records pointing to the physical manifestations of the pamphlets were enhanced with URLs pointing to their digital surrogates and made accessible through the library catalog. [2] These records were also denoted as being destined for the Catholic Portal by adding a value of CRRA to a local note. Consequently, each of the Catholic Pamphlet records also made their way to the Portal. [3]

Because the pamphlets have been digitized, and because the digitized versions of the pamphlets can be transformed into plain text files using optical character recognition, it is possible to provide enhanced services against this collection, namely, text mining services. Text mining is a digital humanities application rooted in the counting and tabulation of words. By counting and tabulating the words (and phrases) in one or more texts, it is possible to “read” the texts and gain a quick & dirty understanding of their content. Probably the oldest form of text mining is the concordance, and each of the digitized pamphlets in the Portal is associated with a concordance interface.

For example, the reader can search the Portal for something like “is the pope always right”, and the result ought to return a pointer to a pamphlet named Is the Pope always right? of papal infallibility. [4] Upon closer examination, the reader can download a PDF version of the pamphlet as well as use a concordance against it. [5, 6] Through the use of the concordance the reader can see that the words church, bill, charlie, father, and catholic are the most frequently used, and by searching the concordance for the phrase “pope is”, the reader gets a single sentence fragment in the result, “…ctrine does not declare that the Pope is the subject of divine inspiration by wh…” And upon further investigation, the reader can see this phrase is used about 80% of the way through the pamphlet.

The process of digitizing library materials is very much like the workflows of medieval scriptoriums, and the process is well understood. Description and access to digital versions of original materials is well-accommodated by the exploitation of MARC records. The next step for the profession to move beyond find & get and towards use & understand. Many people can find many things, with relative ease. The next step for librarianship is to provide services against the things readers find so they can more easily learn & comprehend. Save the time of the reader. The integration of the University of Notre Dame’s Hesburgh Libraries’s Catholic Pamphlets Collection into the Catholic Portal is one possible example of how this evolutionary process can be implemented.

Links

[1] digitization process – http://blogs.nd.edu/emorgan/2012/03/pamphlets/

[2] library catalog – http://bit.ly/sw1JH8

[3] Catholic Portal – http://bit.ly/cathholicpamphlets

[4] “Of Papal Infallibility” – http://www.catholicresearch.net/vufind/Record/undmarc_003078072

[5] PDF version – http://repository.library.nd.edu/view/45/743445.pdf

[6] concordance interface – https://concordance.library.nd.edu/app/concordance/?id=743445

OAI and VuFind: Notes to self in the form of a recipe

The primary purpose of this posting is to document some of my experiences with OAI and VuFind. Specifically it outlines a sort of “recipe” I use to import OAI content into the “Catholic Portal“. The recipe includes a set of “ingredients”, site-specific commands. Towards the end, I ruminate on the use of OAI and Dublin Core for the sharing of metadata.

Philadelphia by Eric Morgan

Recipe

When I learn of a new OAI repository containing metadata destined for the Portal, I use the following recipe to complete the harvesting/indexing process:

  1. Use the OAI protocol directly to browse the remote data repository – This requires a slightly in-depth understanding how OAI-PMH functions, and describing it any additional detail is beyond the scope of this posting. Please consider perusing the OAI specification itself.
  2. Create a list of sets to harvest – This is like making a roux and is used to configure the oai.ini file, next.
  3. Edit/configure harvesting via oai.ini and properties files – The VuFind oai.ini file denotes the repositories to harvest from as well as some pretty cool configuration directives governing the harvesting process. Whomever wrote the harvester for VuFind did a very good job. Kudos!
  4. Harvest a set – The command for this step is in the list of ingredients, below. Again, this is very-well written.
  5. Edit/configure indexing via an XSL file – This is the most difficult part of the process. It requires me to write XSL, which is not too difficult in and of itself, but since each set of OAI content is often different from every other set, the XSL is set specific. Moreover, the metadata of the set is often incomplete, inconsistent, or ambiguous making the indexing process a challenge. In another post, it would behoove me to include a list of XSL routines I seem to use from repository to repository, but again, each repository is different.
  6. Test XSL output for completeness – The command for this step is below.
  7. Go to Step #5 until done – In this case “done” is usually defined as “good enough”.
  8. Index set – Our raison d’être, and the command is given below.
  9. Go to Step #4 for all sets – Each repository may include many sets, which is a cool OAI feature.
  10. Harvest and index all sets – Enhance the Portal.
  11. Go to Step #10 on a regular basis – OAI content is expected to evolve over time.
  12. Go to Step #1 on a less regular basis – Not only does content change, but the way it is described evolves as well. Harvesting and indexing is a never-ending process.

Ingredients

I use the following Linux “ingredients” to help me through the process of harvesting and indexing. I initialize things with a couple of environment variables. I use full path names whenever possible because I don’t know where I will be in the file system, and the VUFIND_HOME environment variable sometimes gets in the way. Ironic.

# configure; first the name of the repository and then a sample metadata file
  NAME=luc
  FILE=1455898167_lucoai_coll25_55.xml

  # (re-)initialize
  rm -rf /usr/local/vufind2/local/harvest/$NAME/*.delete
  rm -rf /usr/local/vufind2/local/harvest/$NAME/*

  # delete; an unfinished homemade Perl script to remove content from Solr
  /usr/local/vufind2/crra/crra-scripts/bin/solr-delete.pl

  # harvest; do the first part of the work
  cd /usr/local/vufind2/harvest/; php harvest_oai.php $NAME

  # test XSL output
  clear; \
  cd /usr/local/vufind2/import; \
  php ./import-xsl.php --test-only \
  /usr/local/vufind2/local/harvest/$NAME/$FILE \
  $NAME.properties

  # index; do the second part of the work
  /usr/local/vufind2/harvest/batch-import-xsl.sh $NAME $NAME.properties

Using the recipe and these ingredients, I am usually able to harvest and index content from a new repository a few hours. Of course, it all depends on the number of sets in the repository, the number of items in each set, as well as the integrity metadata itself.

Ruminations

As I have alluded to in a previous blog posting, the harvesting and indexing of OAI content is not straight-forward. In my particular case, the software is not to blame. No, the software is very well-written. I don’t take advantage of all of the software’s features though, but that is only because I do not desire to introduce any “-isms” into my local implementation. Specifically, I do not desire to mix PHP code with my XSL routines. Doing so seems too much like Fusion cuisine.

The challenge in this process is both the way Dublin Core is used, as well as the data itself. For example, is a PDF document a type of text? Sometimes it is denoted that way. There are dates in the metadata, but the dates are not qualified. Date published? Date created? Date updated? Moreover, the dates are syntactically different: 1995, 1995-01-12, January 1995. My software is stupid and/or I don’t have the time to normalize everything for each and every set. Then there are subjects. Sometimes they are Library of Congress headings. Sometimes they are just keywords. Sometimes there are multiple subjects in the metadata and they are enumerated in one field delimited by various characters. Sometimes these multiple subject “headings” are manifested as multiple dc.subject elements. Authors (creators) present a problem. First name last? Last name first? Complete with birth and death dates? Identifiers? Ack! Sometimes they include unique codes — things akin to URIs. Cool! Sometimes identifiers are URLs, but most of the time, these URLs point to splash pages of content management systems. Rarely do the identifiers point the item actually described by the metadata. And then there out & out errors. For example, description elements containing URLs pointing to image files.

Actually, none of this is new. Diane Hillmann & friends encountered all of these problems on a much grander scale through the National Science Foundation’s desire to create a “digital library”. Diane’s entire blog — Metadata Matters — is a cookbook for resolving these issues, but in my way of boiling everything done to their essentials, the solution is two-fold: 1) mutual agreements on how to manifest metadata, and 2) the writing of more intelligent software on my part.

Indexing and displaying Encoded Archival Description files

This posting muses on how to index and display Encoded Archival Description (EAD) files in the “Catholic Portal” of the Catholic Research Resources Alliance.

The Catholic Portal is essentially an index of two types of metadata: 1) records describing individual and discrete items, and 2) records describing collections of individual items. For the most part, the former metadata records are MARC records describing books. The later are EAD files describing the holdings of archives. Continue reading “Indexing and displaying Encoded Archival Description files”

My experience with Archivists’ Toolkit

by Adam McGinn (July 17, 2012)

During the last two months, I had evaluated Archivists’ Toolkit for use with the Catholic Portal project. Archivists’ Toolkit is a program suitable for recording and managing archival metadata. The program stores metadata in either a remote or local SQL database, and also allows exporting to an XML file. The documentation for Archivists’ Toolkit is quite helpful, though it is fairly comprehensive and it may be difficult to find how to do something specific. I am writing this document in the hope that it will help potential future users of Archivists’ Toolkit here at Hesburgh Library. Continue reading “My experience with Archivists’ Toolkit”

Graphic design and the “Catholic Portal”

Graphic design is definitely not my forte, but I think I have finally wrangled it as well as the overall look & feel of the “Catholic Portal”.

“Skinning” Vufind is not terribly difficult. Using a sort of inheritance, the implementor creates a hierarchy of directories where Vufind will look for customized output views before falling back to a default theme coming with the distribution. I was having one heck of a time getting the search results to display correctly. After looking solutions in all the wrong places, I finally copied a version of the Blueprint theme’s results.tpl file to my local themes directory. After tweaking it a bit, and after refreshing Vufind’s cache and compile directories, things started to line up. As an extra bonus, things like Google, Internet Archive, and HathiTrust snippet views were also being displayed. The “book bag” feature now works as well. Whew!

That said, it requires quite a number of skills in order for Vufind to be implemented properly. They include but are not limited to subject experts, systems administrators, computer programmers, usability technicians, graphic designers, metadata experts, administrators of people, public service personnel, etc.

sample screen dump of “Catholic Portal” search results

Fulltext indexing in Vufind with Aperture

The implementation of fulltext indexing in Vufind with Aperture is not difficult. This posting describes how I implemented it for the Catholic Research Resources Alliance.

About 800 of the 125,000 indexed records in the “Catholic Portal” are linked to full text through a URL in the MARC records’ 856 field. The vast majority of these records come from the University of Toronto and the University of Notre Dame. The process of fulltext indexing is documented at vufind.org, but I’ll clarify here. Continue reading “Fulltext indexing in Vufind with Aperture”