Fulltext indexing in Vufind with Aperture

The implementation of fulltext indexing in Vufind with Aperture is not difficult. This posting describes how I implemented it for the Catholic Research Resources Alliance.

About 800 of the 125,000 indexed records in the “Catholic Portal” are linked to full text through a URL in the MARC records’ 856 field. The vast majority of these records come from the University of Toronto and the University of Notre Dame. The process of fulltext indexing is documented at vufind.org, but I’ll clarify here.

The first step is to download and install Aperture on the same file system as your Vufind implementation. I downloaded version 1.5 which seems to work just fine. From what I saw, there is no configuration necessary.

Second, verify that Aperture works by running it from the command line. In other words, change to the Aperture directory and run bin/webcrawler.sh with arguments something like this:

bin/webcrawler.sh http://zoia.library.nd.edu/carmedemessire02from.pdf

The result should be a tiny report listing how much time and effort was spent by the crawler. At first my installation did not work because I was not using a fully baked version of Java. After identifying the location of a full-blown version of Java, I hard-coded its full path in webcrawler.sh.

Working backwards, I discovered that webcrawler.sh gets executed through a Bean Shell script called import/index_scripts/getFulltext.bsh. Unfortunately for me, Vufind incorrectly tried to guess or assume the location of my fulltext.ini file. Consequently I hard-coded the value of fulltextIniFile on or around line 94 in getFulltext.bsh.

Next I edited web/conf/fulltext.ini. Specifically, I uncommented one of the webcrawler assignments and edited it to point to the full path of webcrawler.sh.

The last configuration was the easiest. In import/marc_local.properties I uncommented the value for fulltext.

Once all this work was done I am able to index normally. When the indexer encounters a URL ending in pdf in MARC field 856$u webcrawler.sh is called. Webcrawler.sh harvests the remote PDF document, extracts plain text, and returns it to the indexer. The indexer then saves the plain text to the index for searching. Obviously, this extra step increases indexing time considerably.

Many of the PDF documents in the “Portal” are in French. Consequently, I am able to search the Portal for Saint-Christôt and find Tong-King et martyr, ou vie du vénérable Jean-Louis Bonnard, missionnaire au Tong-King, décapité pour la foi le 1er mai 1852.

I think a greater number of library “catalogs” and “discovery systems” ought to support full text indexing. Vufind supports this functionality without too much difficulty.

Author: Eric Lease Morgan

I am a librarian first and a computer user second. My professional goal is to discover new ways to use computers to provide better library services. I use much of my time here at the University of Notre Dame developing and providing technical support for the Catholic Research Resources Alliance -- the "Catholic Portal".