Limit to full text in VuFind

This posting outlines how a “limit to full text” functionality was implemented in the “Catholic Portal’s” version of VuFind.

While there are many dimensions of the Catholic Portal, one of its primary components is a sort of union catalog of rare and infrequently held materials of a Catholic nature. This union catalog is comprised of metadata from MARC records, EAD files, and OAI-PMH data repositories. Some of the MARC records include URLs in 856$u fields. These URLs point to PDF files that have been processed with OCR. The Portal’s indexer has been configured to harvest the PDF documents, when it comes across them. Once harvested the OCR is extracted from the PDF file, and the resulting text is added to the underlying Solr index. The values of the URLs are saved to the Solr index as well. Almost by definition, all of the OAI-PMH content indexed by Portal is full text; almost all of the OAI-PMH content includes pointers to images or PDF documents.

Consequently, if a reader wanted to find only full text content, then it would be nice to: 1) do a search, and 2) limit to full text. And this is exactly what was implemented. The first step was to edit Solr’s definiton of the url field. Specifically, its “indexed” attribute was changed from false to true. Trivial. Solr was then restarted.

The second step was to re-index the MARC content. When this is complete, the reader is able to search the index for URL content — “url:*”. In other words, find all records whose URL equals anything.

The third step was to understand that all of the local VuFind OAI-PMH identifiers have the same shape. Specifically, they all include the string “oai”. Consequently, the very astute reader could find all OAI-PMH content with the following query: “id:*oai*”.

The third step was to turn on a VuFind checkbox option found in facets.ini. Specifically, the “[CheckboxFacets]” section was augmented to include the following line:

id:*oai* OR url:* = “Limit to full text”

When this was done a new facet appeared in the VuFind interface.

Finally, the whole thing comes to fruition when a person does an initial search. The results are displayed, and the facets include a limit option. Upon selection, VuFind searches again, but limits the query by “id:*oai* OR url:*” — only items that have URLs or come from OAI-PMH repositories. Pretty cool. Catholic Portal's version of VuFind

Kudos go to Demian Katz for outlining this process. Very nice. Thank you!

CRRA Update Spring 2016

CRRA Update Spring 2016
(December, January, February)
Please see the PDF for the more visually rich version.

Feature Article: The Oliver Leonard Kapsner, O.S.B. Cataloging Bulletin: A Resource for Catalogers of Catholic Publications
From the Board
Committee Briefs

Newspapers Project
Tech Corner
Collection Highlights
News from Our Members
News from CRRA
For Your Consideration
Save the Date!

Catholic Pamphlets and the Catholic Portal: An evolution in librarianship

cover pageThis blog posting outlines, describes, and demonstrates how a set of Catholic pamphlets were digitized, indexed, and made accessible through the Catholic Portal. In the end it advocates an evolution in librarianship.

A few years ago, a fledgling Catholic pamphlets digitization process was embarked upon. [1] In summary, a number of different library departments were brought together, a workflow was discussed, timelines were constructed, and in the end approximately one third of the collection was digitized. The MARC records pointing to the physical manifestations of the pamphlets were enhanced with URLs pointing to their digital surrogates and made accessible through the library catalog. [2] These records were also denoted as being destined for the Catholic Portal by adding a value of CRRA to a local note. Consequently, each of the Catholic Pamphlet records also made their way to the Portal. [3]

Because the pamphlets have been digitized, and because the digitized versions of the pamphlets can be transformed into plain text files using optical character recognition, it is possible to provide enhanced services against this collection, namely, text mining services. Text mining is a digital humanities application rooted in the counting and tabulation of words. By counting and tabulating the words (and phrases) in one or more texts, it is possible to “read” the texts and gain a quick & dirty understanding of their content. Probably the oldest form of text mining is the concordance, and each of the digitized pamphlets in the Portal is associated with a concordance interface.

For example, the reader can search the Portal for something like “is the pope always right”, and the result ought to return a pointer to a pamphlet named Is the Pope always right? of papal infallibility. [4] Upon closer examination, the reader can download a PDF version of the pamphlet as well as use a concordance against it. [5, 6] Through the use of the concordance the reader can see that the words church, bill, charlie, father, and catholic are the most frequently used, and by searching the concordance for the phrase “pope is”, the reader gets a single sentence fragment in the result, “…ctrine does not declare that the Pope is the subject of divine inspiration by wh…” And upon further investigation, the reader can see this phrase is used about 80% of the way through the pamphlet.

The process of digitizing library materials is very much like the workflows of medieval scriptoriums, and the process is well understood. Description and access to digital versions of original materials is well-accommodated by the exploitation of MARC records. The next step for the profession to move beyond find & get and towards use & understand. Many people can find many things, with relative ease. The next step for librarianship is to provide services against the things readers find so they can more easily learn & comprehend. Save the time of the reader. The integration of the University of Notre Dame’s Hesburgh Libraries’s Catholic Pamphlets Collection into the Catholic Portal is one possible example of how this evolutionary process can be implemented.

Links

[1] digitization process – http://blogs.nd.edu/emorgan/2012/03/pamphlets/

[2] library catalog – http://bit.ly/sw1JH8

[3] Catholic Portal – http://bit.ly/cathholicpamphlets

[4] “Of Papal Infallibility” – http://www.catholicresearch.net/vufind/Record/undmarc_003078072

[5] PDF version – http://repository.library.nd.edu/view/45/743445.pdf

[6] concordance interface – https://concordance.library.nd.edu/app/concordance/?id=743445

CRRA Update Winter 2016

CRRA Update Winter 2016
(December, January, February)
Please see the PDF for the more visually rich version.

Feature Article: Interview with Michael Skaggs
From the Board
Committee Briefs

Newspapers Project

Tech Corner

Collection Highlights

News from Our Members
News From CRRA

For Your Consideration

Save the Date!

The Jesuit Libraries Provenance Project

Kyle Roberts, Loyola University Chicago

The Jesuit Libraries Provenance Project (JLPP) was launched in March 2014 to create a visual archive of provenance marks from historic Jesuit college, seminary, and university library collections and to foster a participatory community interested in the history of these books.

Nineteenth-century Jesuits never met a book that they didn't like to stamp their name on. This stamp is found on books from Loyola's original library collection (c.1870).
Nineteenth-century Jesuits never met a book that they didn’t like to stamp their name on. This stamp is found on books from Loyola’s original library collection (c.1870).

Founded by students, faculty, and library professionals at Loyola University Chicago, the Provenance Project is an outgrowth of an earlier project [http://blogs.lib.luc.edu/archives/] to reconstruct the holdings listed in Loyola’s original (c.1878) library catalog in an innovative virtual library system. That project, which was the subject of a graduate seminar at Loyola in Fall 2013 and will launch later this year, brought together graduate students in Digital Humanities, History, and Public History to recreate the nineteenth-century library catalog in a twenty-first century open source Integrated Library System (ILS). In the course of researching the approximately 5100 titles listed in the original catalog, students discovered that upwards of 1750 might still be held in the collections of Loyola’s Cudahy Library, the Library Storage Facility, and University Archives and Special Collections. A handful of undergraduate and graduate students formed the Provenance Project the following semester to see how many of these books actually survived. As they pulled books off the shelves and opened them up, they discovered a range of provenance marks – bookplates, inscriptions, stamps, shelf-marks, and other notations – littering the inside covers, flyleaves, and title pages of these books. Students soon realized that if the original library catalog could tell them what books the Jesuits collected, provenance marks could reveal from where the books came.

 

The inside covers of books collected by Jesuit can have bookplates, stamps, and sometimes surprising marginalia. From The Spirit of Popery (n.d.)
The inside covers of books collected by Jesuit can have bookplates, stamps, and sometimes surprising marginalia. From The Spirit of Popery (n.d.)

By utilizing the freely accessible online social media image-sharing platform Flickr, the Provenance Project seeks to create a participatory community of students, bibliographers, academics, private collectors, alumni, and others interested in the origin and history of Jesuit-collected books. A photostream within the Provenance Project Flickr site allows visitors to scroll through all of the pictures that have been uploaded while commenting and tagging functions provide the opportunity to share their own knowledge about specific images. For example, visitors can contribute transcriptions of inscriptions (especially ones written in messy or illegible hands), translations of words and passages in foreign languages, and identifications of former individual and institution owners. Not only does the Flickr site provide a visual index of the rich variety of works held by a late nineteenth-century Jesuit college library, but it also inspires reflection and scholarship on the importance of print to Catholic intellectual, literary, and spiritual life.

The Provenance Project also encourages undergraduate and graduate students to undertake mentored primary-source research on the history of individual books as well as broader themes in Catholic and book history. Their findings are shared with the public in a variety of ways. One of the rooms in the Summer 2014 exhibition, Crossings and Dwellings: Restored Jesuits, Women Religious, American Experience 1814-2014 at the Loyola University Museum of Art (LUMA) featured original library books selected by graduate students and accompanied by interpretative labels they wrote. Student interns regularly contribute original scholarship to the Provenance Project’s website as well as to the June 2015 issue of the Catholic Library World on the “Digital Future of Jesuit Studies.” [Citation: “The Digital Future of Jesuit Studies,” Catholic Library World 85:4 (June 2015): 240-259.] They have also given talks on their research at conferences, such as the annual meeting of the American Catholic Historical Association. The 2014 commemoration of the bicentennial of the restoration of the Society of Jesus has brought renewed scholarly to nineteenth-century Jesuits. The work of Provenance Project interns is actively contributing to that resurgence of interest.

The Flickr photostream for the Jesuit Libraries Provenance Project.
The Flickr photostream for the Jesuit Libraries Provenance Project.

As of February 2016, students have tracked down all of the surviving books from the list of 1750 titles and are in the process of discerning how many of these titles are actual matches for those in the original catalog. (The answer appears to be the vast majority, making for a much higher survival rate than initially expected.) The team recently posted its 5000th image to the Flickr archive and still has many more images to upload over the coming months. Images on Flickr have also been usefully organized into albums either by nature of provenance mark (stamp, bookplate), part of book (illustrations, endpapers, binding), or division of the catalog (Pantology, Theology, Legislation, Philosophy, History, Literature). For those who would like to contribute to the Project, there are still many passages in need of translation and ownership marks in need of identification (helpfully gathered into the albums “Unidentified Inscriptions”, “Unidentified Stamps”, “Unidentified Embossed Stamps”, and “Unidentified Bookplates”).

Please follow the JLPP on Flickr (@JLPProject), Facebook and on Twitter (@JesuitProject). We try to post new books everyday and scholarship on the blog every week or so during the semester, so check back often!

A final note: the Provenance Project is beginning conversations about expanding the site to include provenance images from the collections of other historic Jesuit college, seminary, and university libraries. If you are interested in learning more about participating, or want information about how to start a project for your own institution, don’t hesitate to contact Kyle Roberts.

Interview with Michael Skaggs

Michael SkaggsMichael Skaggs is a doctoral candidate in the Department of History at the University of Notre Dame. He studies religion in the American Midwest, and is particularly interested in how interfaith organizations addressed social problems.

What is your current area of research?

Right now, I’m working on a dissertation chapter on Catholic racial activism in 1960s Cincinnati. Partly those men and women did so because they got involved in the contemporary civil rights movement, but the Second Vatican Council’s call for the laity to be active in society had something to do with it, too. But I think the blend of those two motivations is more complicated than it seems at first.

More generally, my dissertation asks how Catholics in one midwestern place – Cincinnati, Ohio – responded to the Second Vatican Council, and how the presence of a substantial Jewish community inflected that response. This presents us with a fascinating opportunity to understand the real richness of American Catholicism, which I think we miss out on if we overlook places like Cincinnati, which usually don’t seem to be all that important to us.

Graduate students in search of dissertation topics are well-positioned to draw attention to topics and places long untouched by scholars! And while there are many scholars across the career timeline ready to embrace digitization, I think the younger generation has a natural ability to work with these resources – maybe even an impatience to do things “the old way.” This is a transitional moment in academia, though, so there’s a real need for students and future scholars to straddle the line between technologies old and new.

How do you use CRRA’s resources for your research?

I first came to know about CRRA just after I had finished a research project on The Criterion, the Archdiocese of Indianapolis’s official newspaper. I did it the new-old-fashioned way: cranking through what felt like miles of microfilm. Now I work with CRRA’s newspaper digitization project and have been excited at the conversations surrounding getting these sources into a format that we can use quickly and easily.

CRRA has been particularly useful in considering how I might shape my research projects to benefit from digitization in the future. Since most of my sources are not yet digitized, it’s been wonderful to look ahead and consider what might reasonably be digitized in the future and the scholarly community that will arise around those sources. It’s exciting to think about being part of a conversation that more and more people enter as sources open up to easy access from afar.

What is the most exciting / surprising source you’ve been able to get access to for your research?

I have to point to the old-school method of research for this one, too, because Cincinnati doesn’t get the attention it really ought to – a lot of Catholic scholarship has been focused to this point on “more important” places in the American Church. So the biggest and most impressive collections that CRRA catalogs come from elsewhere – an imbalance that CRRA is sure to fix in coming years and as more and more diocesan archives get involved. But I would say the most exciting – or one of the most exciting sources – has been The American Israelite, which was published by and for American Reform Jews. It’s fully digitized but only accessible in certain locations – a prime opportunity for CRRA, since the Israelite reported on things Catholic quite often!

The American Israelite points up the potential of partnerships between the academy and religious institutions. While no small project, that one newspaper presented a relatively straightforward digitization task. And many organizations would only be too happy to let CRRA digitize their materials if the funding is available and it can be done in a reasonable amount of time. Furthermore, CRRA has utilized an excellent strategy of asking scholars themselves what they need access to, as this provides a clear (if not concise) idea of sources that might be targeted for digitization. Most scholars with particular research projects can identify exactly which collections it would be useful to digitize, which makes the process manageable, even if not all that easy. From there, related collections can be identified for future scholars who aren’t working with them just yet, or individual archives can propose collections that really ought to be made digital, and so on. It pretty plainly represents the future and I’m happy to know CRRA is working hard to get ahead of the game.

What do you wish you could have access to but is currently unavailable?

I sound like a broken record whenever I’m asked this in CRRA conversations: fully digitized diocesan newspapers from across the United States. I think that would open up research fields historians have not even begun to consider, especially since having all of that information readily available would really help us uncover the complexity of American Catholicism from place to place. Many dioceses have the entire run of their newspaper preserved, in some cases very well so. A program just for diocesan archives – and especially their newspapers – would be a fantastic way to bring these sources into the mainstream of academic research, especially those smaller or “less important” dioceses that historians haven’t thought of yet.

I also think that archival sources on parish histories are a goldmine yet to be tapped by most scholars. The problem here is accessibility, or even knowing where they are kept: more than once I have run into a parish saying their materials are held at the diocesan archives, while the diocesan archivist says the materials are at the parish! And in many cases people just haven’t saved much. But if we really want to know about American Catholicism, we desperately need access to the sources pertinent to the vast majority of American Catholics: the laity, who connect to the Church first at the parish level. These materials don’t need to be all that in-depth to provide something useful, either – I’d be perfectly happy with a solid set of parish bulletins over a given period of time, for example, for what it would tell us about parish life. Again, this is where CRRA is in a great place to help, through utilizing scholars’ needs and wants to identify, catalog, and digitize collections.

OAI and VuFind: Notes to self in the form of a recipe

The primary purpose of this posting is to document some of my experiences with OAI and VuFind. Specifically it outlines a sort of “recipe” I use to import OAI content into the “Catholic Portal“. The recipe includes a set of “ingredients”, site-specific commands. Towards the end, I ruminate on the use of OAI and Dublin Core for the sharing of metadata.

Philadelphia by Eric Morgan

Recipe

When I learn of a new OAI repository containing metadata destined for the Portal, I use the following recipe to complete the harvesting/indexing process:

  1. Use the OAI protocol directly to browse the remote data repository – This requires a slightly in-depth understanding how OAI-PMH functions, and describing it any additional detail is beyond the scope of this posting. Please consider perusing the OAI specification itself.
  2. Create a list of sets to harvest – This is like making a roux and is used to configure the oai.ini file, next.
  3. Edit/configure harvesting via oai.ini and properties files – The VuFind oai.ini file denotes the repositories to harvest from as well as some pretty cool configuration directives governing the harvesting process. Whomever wrote the harvester for VuFind did a very good job. Kudos!
  4. Harvest a set – The command for this step is in the list of ingredients, below. Again, this is very-well written.
  5. Edit/configure indexing via an XSL file – This is the most difficult part of the process. It requires me to write XSL, which is not too difficult in and of itself, but since each set of OAI content is often different from every other set, the XSL is set specific. Moreover, the metadata of the set is often incomplete, inconsistent, or ambiguous making the indexing process a challenge. In another post, it would behoove me to include a list of XSL routines I seem to use from repository to repository, but again, each repository is different.
  6. Test XSL output for completeness – The command for this step is below.
  7. Go to Step #5 until done – In this case “done” is usually defined as “good enough”.
  8. Index set – Our raison d’être, and the command is given below.
  9. Go to Step #4 for all sets – Each repository may include many sets, which is a cool OAI feature.
  10. Harvest and index all sets – Enhance the Portal.
  11. Go to Step #10 on a regular basis – OAI content is expected to evolve over time.
  12. Go to Step #1 on a less regular basis – Not only does content change, but the way it is described evolves as well. Harvesting and indexing is a never-ending process.

Ingredients

I use the following Linux “ingredients” to help me through the process of harvesting and indexing. I initialize things with a couple of environment variables. I use full path names whenever possible because I don’t know where I will be in the file system, and the VUFIND_HOME environment variable sometimes gets in the way. Ironic.

# configure; first the name of the repository and then a sample metadata file
  NAME=luc
  FILE=1455898167_lucoai_coll25_55.xml

  # (re-)initialize
  rm -rf /usr/local/vufind2/local/harvest/$NAME/*.delete
  rm -rf /usr/local/vufind2/local/harvest/$NAME/*

  # delete; an unfinished homemade Perl script to remove content from Solr
  /usr/local/vufind2/crra/crra-scripts/bin/solr-delete.pl

  # harvest; do the first part of the work
  cd /usr/local/vufind2/harvest/; php harvest_oai.php $NAME

  # test XSL output
  clear; \
  cd /usr/local/vufind2/import; \
  php ./import-xsl.php --test-only \
  /usr/local/vufind2/local/harvest/$NAME/$FILE \
  $NAME.properties

  # index; do the second part of the work
  /usr/local/vufind2/harvest/batch-import-xsl.sh $NAME $NAME.properties

Using the recipe and these ingredients, I am usually able to harvest and index content from a new repository a few hours. Of course, it all depends on the number of sets in the repository, the number of items in each set, as well as the integrity metadata itself.

Ruminations

As I have alluded to in a previous blog posting, the harvesting and indexing of OAI content is not straight-forward. In my particular case, the software is not to blame. No, the software is very well-written. I don’t take advantage of all of the software’s features though, but that is only because I do not desire to introduce any “-isms” into my local implementation. Specifically, I do not desire to mix PHP code with my XSL routines. Doing so seems too much like Fusion cuisine.

The challenge in this process is both the way Dublin Core is used, as well as the data itself. For example, is a PDF document a type of text? Sometimes it is denoted that way. There are dates in the metadata, but the dates are not qualified. Date published? Date created? Date updated? Moreover, the dates are syntactically different: 1995, 1995-01-12, January 1995. My software is stupid and/or I don’t have the time to normalize everything for each and every set. Then there are subjects. Sometimes they are Library of Congress headings. Sometimes they are just keywords. Sometimes there are multiple subjects in the metadata and they are enumerated in one field delimited by various characters. Sometimes these multiple subject “headings” are manifested as multiple dc.subject elements. Authors (creators) present a problem. First name last? Last name first? Complete with birth and death dates? Identifiers? Ack! Sometimes they include unique codes — things akin to URIs. Cool! Sometimes identifiers are URLs, but most of the time, these URLs point to splash pages of content management systems. Rarely do the identifiers point the item actually described by the metadata. And then there out & out errors. For example, description elements containing URLs pointing to image files.

Actually, none of this is new. Diane Hillmann & friends encountered all of these problems on a much grander scale through the National Science Foundation’s desire to create a “digital library”. Diane’s entire blog — Metadata Matters — is a cookbook for resolving these issues, but in my way of boiling everything done to their essentials, the solution is two-fold: 1) mutual agreements on how to manifest metadata, and 2) the writing of more intelligent software on my part.

CRRA Update Fall 2015

CRRA Update
Fall 2015
(September, October, November)
please see the PDF for
the more visually rich version

Feature Article: Interview with Jim McCartin

From the Board

Committee Briefs

Tech Corner

Collection Highlights

News from Our Members

News from CRRA

Save the Date!

Interview with Jim McCartin, Fordham University

Jim McCartin, Associate Professor of Theology and Director of the Center on Religion and Culture at Fordham University, joins us for a discussion of his research and the role CRRA has played in shaping and abetting his scholarly work. His book, Prayers of the Faithful: The Shifting Spiritual Life of American Catholics, came out in 2010 and explores prayer in the lives of American Catholics from the 1860s to the 1980s. His current project is the book: American Catholics and Sex from the 1830s to the 1980s.

What is your current area of research?

I’m currently working on a book project on the history US Catholics and sex from the 1830s to the 1980s. The study begins with early nineteenth-century European Catholic immigrants and the anxieties they provoked among non-Catholics concerned that Catholics were sexual deviants because of their practice of vowed celibacy, and it ends with the emerging story of clerical sex abuse in the late twentieth century. In between, it turns out that the story of US Catholics and sex is a great deal more interesting and complicated than historians and others have normally assumed, which makes this project especially exciting.

How did you get interested in your research area?

Well, after the clerical sex abuse scandal exploded in 2002, it occurred to me that, while there is a lot of published work out there on the history of US sexuality, that work has not dealt at all adequately with how religion fits into the story of sex, and in particular, it hasn’t given very serious attention to Catholicism’s place in that story. I was looking for ways to think about how we get to the clerical sex abuse scandal of the early 2000s, and I found nothing that could provide an adequate, sensible narrative grounded in deep archival research. So, while my goal isn’t specifically to write a history of Catholicism and sex abuse, this project emerged out of a desire to offer a narrative that is sufficiently textured and grounded and one that can help to place sex abuse into a larger narrative frame.

How do you use the CRRA’s resources for your research?​ Which resources have been the most helpful, and why? How has Catholic Newspapers Online been useful?

​CRRA has been extremely useful in helping me identify a whole array of published and archival sources for this project. There’s no better way to be able to survey the published materials on Catholicism available in the United States, and I’ve made probably a dozen archival trips based on materials I’ve identified through the Portal. I have to say that I’ve been especially grateful for the digitized newspapers, though, which have been a tremendous source as I try to get a sense for how family life and related questions of sexuality played out on the ground in various local settings.

What’s the most exciting/surprising source you’ve been able to get access to for your research?

​There’s are a lot out there that has fascinated me. Among the most interesting sources I’ve come across are the trial records for an 1843 clerical rape trial that figures into the narrative I’m constructing. But there’s also just a wealth of interesting documentation on the practice of clerical celibacy in the 1880s and 1890s, on sex education in the 1920s, on Catholic arguments over the Rhythm Method in the 1930s, and on same-sex attraction in the 1940s and 1950s. It turns out that US Catholics had a quite complicated and pretty well-informed conversations around these and other themes, conversations that are much more nuanced and interesting than they are normally given credit for.

What do you wish you could get access to but is currently unavailable?

​Good question. I’m not exactly sure there are resources out there on this, but I’d love to have access to documents that provide a clearer sense of how sexuality was framed in the formation of male and female religious in the first half of the twentieth century. I’d also love to see archival materials related to the work of the Servants of the Paraclete, a religious order that, already in the early psot-1945 era, began to care for priests involved in sexual relationships of one kind or another.

CRRA Update Summer 2015

CRRA Update
Summer 2015
(June, July, August)
please see the PDF for
the more visually rich version

In this issue: