This text documents the technical support required to keep the “Catholic Portal” running smoothly. In a nutshell, it falls into three categories:
- assisting CRRA members in making their metadata available
- harvesting and indexing member metadata
- maintaining the Portal’s software
The “Catholic Portal” is a part of the Catholic Research Resources Alliance (CRRA). The purpose of the Alliance is “to provide enduring global access to Catholic research resources”. To this end the Portal is currently and primarily a metadata index — pointers to “rare, unique and uncommon research materials”. The Portal is implemented using a variety of (mostly) open source software including: Vufind, WordPress, Concrete5, Apache, Analog, ProFTPD, Filemaker, and a set of locally developed shell and Perl scripts. The balance of this text describes how these pieces of software are used to provide technical support against the Portal.
Technical support includes assisting Alliance members on how to make their metadata accessible via the Web as well as providing a means for putting metadata on the Web for those members who are unable to do this at their local institution.
The Portal currently supports three types of metadata: MARC, EAD, and Dublin Core formatted in XML. Instead of emailing metadata to a centralized authority, Portal metadata is expected to be accessible via the Web; members are expected to regularly extract metadata from there local systems and make it accessible via one or more URLs. This particular process is documented in a pair of blog postings: “Making your content available” and “How to make MARC and EAD metadata available in the ‘Catholic Portal’“.
Some Alliance members are not authorized to make content available via the Web. Some members do not want to make their content available via their local Web servers. In order to satisfy these members, the Alliance supports an FTP “dropbox” where members can put their metadata. This FTP site doubles as a website enabling all member content to be accessible via one or more URLs. This process is described in a blog posting called “How to make CRRA metadata available via the FTP ‘dropbox’“.
The dropbox is implemented on a computer named crradrop.library.nd.edu with software called ProFTPD. The software is configured in the file /usr/local/etc/proftpd.conf. Alliance member usernames and passwords are created with a rudimentary shell script at /usr/local/sbin/proftpd-user.sh. After username/password combinations are shared with members, the dropbox’s firewall configuration needs to be updated for the members’ IP addresses. This is done by editing /etc/sysconf/iptables. To make the member metadata accessible via one or more URLs, the dropbox’s file system is also accessible via the Web using Apache, and the Apache’s configuration file is located at /etc/httpd/conf/httpd.conf.
Harvesting & indexing metadata
The second level of technical support involves harvesting and indexing member metadata, thus making it available via the Portal. This is facilitated through a set of locally developed shell and Perl scripts.
The Portal resides on a computer named cportal.library.nd.edu. All of the shell and Perl scripts, as well as their supporting configuration files are located in /shared/cportal_prod/data/crra/crra-scripts. The most important configuration file is called etc/libraries.db, and it is essentially a list of members, their addresses, and URLs pointing to their Web-accessible metadata. Most of the Perl scripts reference this configuration file. It is very important.
The Portal currently supports three types of metadata: 1) MARC, 2) EAD, and 3) a flavor of Dublin Core/XML exported from PastPerfect systems. MARC metadata is ingested using the following scripts:
- bin/marc-harvest.pl – Using HTTP, this script mirrors files of remotely located MARC records.
- bin/marc-add-code.pl – Reads mirrored MARC records and updates each record’s 001 field so it is unique across the Portal’s index. It does this by simply prepending the value of the 001 field with a three-letter code denoting a CRRA member institution.
- bin/marc-index.pl – This script reads the updated MARC records and inserts them into the Portal’s index (Solr). It is helpful to restart the Portal (sudo /usr/sbin/clusvcadm -R cportalprod) before this script is run in order to reduce the possibility of timeout errors.
- bin/marc-build.sh – This is a brain-dead shell script is used to run each of scripts above in batch. Ideally, this script should be run as a cron job.
The process of ingesting EAD files is similar, and it is supported with the following scripts:
- bin/ead-harvest.pl – This script mirrors all the .xml files in a given HTTP-accessible directory.
- bin/ead-validate.pl – This script makes sure the mirrored XML files are well-formed and validate against the EAD DTD or EAD schema.
- bin/ead-transform.pl – This script has two functions. First, akin to marc-add-code.pl, this script adds unique identifiers to each did-level element of the given EAD file. Second, the script transforms each of the EAD files into a browsable HTML file, and the resulting file is saved to a local Web-accessible directory. At the time of this writing, this script needs a great deal of rewriting since the transformation process does not perform very well against schema-based EAD files.
- bin/ead-index.pl – Looping through each of the validated and updated EAD files, this script parses the out metadata (title, author, subject terms, abstracts, scope notes, etc.), and saves the result to index (Solr). Because some of the more useful EAD metadata does not map directly to the out-of-the-box indexing schema, this script takes advantage of the indexer’s dynamic fields to create EAD-specific entries.
- bin/ead-build.pl – Like marc-build.sh, this shell script is designed to run each of the EAD scripts in sequence. It is intended to be executed as a cron job.
Since only a single institution (Philadelphia Archdiocesan Historical Research Center) supports the Dublin Core/XML format, the harvest and index process is combined into a single script — bin/pastperfect-index.pl. This script reads a remote XML file via HTTP, maps the file’s Dublin Core elements to Portal-specific fields, and updates the Portal’s index. The XML output of the Research Center was designed to be amenable to other PastPerfect members, if there should ever by any. If additional institutions using PastPerfect do become CRRA members, then this script may not scale and will need to be tweaked to read a configuration file and/or rewritten to support the output of the additional member(s).
The Portal itself is made up a quite a bit of software, the most important being Vufind. The balance of this section describes each component in turn.
Vufind is the “heart & soul” of the Portal. It is a great example of a LAMP (Linux, Apache, MySQL, PHP) stack system. Installing and completely configuring Vufind is beyond the scope of this document, but the following things need to be kept in mind:
- Linux – Vufind will run on other operating systems, but it is really designed to run on Linux. A more-than-basic understanding of Linux is needed in order to maintain Vufind. This includes setting up and configuring the network, installing and maintaining software, creating user accounts, etc.
- MySQL – Vufind uses MySQL to maintain the state of user sessions, user accounts, and tagging. The systems administrator is expected to create an accessible database for a specific username/password combination, and these values are then denoted in the “[Database]” section of Vufind’s primary configuration file (web/conf/config.ini).
- Apache – Vufind-specific Web server configurations are saved in a file named cportal_site.conf. The configuration file does a number of things: denotes the root of the Vufind filesystem, specifies a number of mod_rewrite rules, and creates a number of aliases for other parts of the Portal (blog, CMS, generic Web space, etc.).
- PHP – PHP is the scripting language used to query the underlying index (Solr) and display the search results. It is necessary to have at least a working knowledge of PHP in order to support the Portal. This is true for two reasons. First, Vufind’s themes — the system’s look & feel — are configured through a combination of PHP code and HTML. Second, the Portal’s specific record drivers — modules used to display search results — are written in PHP. Each of these things are described in greater detail below.
Being essentially a metadata index, the Portal requires an… indexer. Solr is the indexer used by Vufind, and consequently the underlying operating system needs to support Java. Solr does not necessarily need to be installed because it comes pre-packaged in Vufind’s solr directory. On the other hand, since more than one instance of Solr is presently running on cportal.library.nd.edu, the Web server interfacing with Solr (Jetty) needs to be configured to run under a port other than 8080. This configuration has been set in the solr/jetty/jetty.xml file with the “SystemProperty” named “jetty.port”, specifically with the value 8081. This configuration is then reflected in the “[Statistics]” and “[Index]” sections of Vufind’s primary configuration file.
There are three ways Vufind has been customized for the Alliance: 1) configurations, 2) themes, and 3) record drivers. Each of these are described below.
Vufind comes with a whole host of configuration files, and they all live in the web/conf directory, but by far the most important is config.ini. The file is divided into a number of sections, and the most important are:
- Site – This section is used to denote where on the Web Vufind is located, and where it is located on the local file system. The values for path, url, and local need to correspond to similar values the Apache’s configuration file.
- Index – This section denotes the HTTP and filesystem location of the indexer (Solr). The value for url will always be something like http://localhost:8081/solr, but be forewarned. Multiple instances of Solr may be running on the same host as the Portal, and through Solr’s jetty.xml file, each instance of Solr will need to be configured differently as well as be reflected differently in something like the value of url.
- Database – This section denotes the HTTP location of the MySQL database.
- Content – Edit the values in this section to enhance the content search results. For example, this section allows search results to have cover images, links to Wikipedia articles, snippet previews, additional reviews, etc. The choices made here really ought to be run by the CRRA’s Digital Access Committee for selection.
Two other configuration files of interest are fulltext.ini (described in the section on Aperture) and sitemap.ini (described in the section on SEO). All of the other configurations in web/conf and not been… configured.
VuFind record drivers
Record drivers are pieces of PHP code used to read search results returned from Solr and pass them on to themes for display. Maintaining the record drivers is probably the most complicated aspect of Vufind, next to the themes. Each of the record drivers currently in use by the Portal are described below.
- IndexRecord.php – The functions in this driver always get used, unless overridden by another driver later in this list. The only thing enhanced in this driver was the creation of the CRRA member institution (CRRAInstitution), library (CRRALibrary), call number of item (summCallNo), and unique key (CRRAKey). These values have been extracted in the functions named getHoldings and getSearchResult.
- MarcRecord.php – Because the Portal does not get detailed information — like holdings — from an underlying integrated library system, this particular record driver was tweaked in one tiny way. Specifically, in the function named getSearchResult, the value of summAjaxStatus needs to be set to false.
- EadRecord.php – This record driver is specific to the Portal, and it is used to first extract metadata from search results. Once that is done the metadata is passed on the calling theme for display. Some of the more interesting functions include: getAllSubjectHeadings, getScopeContent, getBiogHist, and getExtendedMetadata.
- PpRecord.php – This is the record driver for Past Perfect data — the XML data coming from Philadelphia Archdiocesan Historical Research Center. This driver includes something to extract subject headings (getAllSubjectHeadings), “holdings” (getHoldings), and URLs (getURLs).
The look & feel of VuFind is governed through themes — a combination of PHP and HTML files. The current look & feel is called “crranew” and it is located the web/interface/themes directory. As specified in the config.ini file, the crranew theme inherits features from the blueprint theme. Consequently, it is not necessary to write the Portal’s entire user interface.
The most important files to maintain in the crranew theme include:
- header.tpl and footer.tpl – These are the Portal’s… header and footer.
- layout.tpl – This is the content between the header and the footer.
- extended.tpl, holdings.tpl, and result.tpl – These template files, located in the RecordDrivers/Index directory, are used to display search results, individual records, and individual records’ specific characteristics. For the most part, they have been customized from the original blueprint theme to include links to directory information as well as links to external HTML files (transformed EAD files).
Concrete5 is a open source content management system written in PHP and requires MySQL. Concrete5 is used to provide textual information about the Alliance — who is involved, what its purpose is, news & information, etc. Based on my experience, everything in Concrete5 is configurable through a Web interface, with the exception of the system’s look & feel. The Alliance’s theme is saved in Concrete5’s themes/crra directory. If a person needs to edit the configurations surrounding MySQL, then the person needs to edit the config/site.php file.
Registered users of Concrete5 can be organized into groups. When people denoted as administrators log into Concrete5 they may be alerted to the existence of new versions of the software. Upgrading Concrete5 is Web-based but back-ups of the system’s underlying MySQL database is suggested prior to actually doing any of this sort of maintenance.
WordPress has been used to blog about the Portal. The root of the blog is configured in Apache’s cportal_site.conf file. WordPress requires MySQL and therefore WordPress requires a database as well as a username/password combination. These values are configured in WordPress’s wp-config.php file. Maintaining WordPress is simply a matter of monitoring when new versions of WordPress become available, installing the new versions, as well as installing new versions of various WordPress plug-ins as they too become available. When upgrading WordPress it is very important to not overwrite the contents of the wp-content directory because this is where attachments, images, and sundry files supplied by bloggers is stored.
The content management system is the place where a membership directory ought to be maintained, but until such a thing is implemented, a partial membership directory is has been implemented programmatically. Its purpose is to provide reader’s of the Portal with the names, addresses, and contact information of institutions who have supplied metadata.
Names, addresses, and contact information have been saved in the CRRA scripts configuration file (etc/libraries.db). When a Perl script (bin/directory.pl) is executed the result is a rudimentary HTML file with named anchors corresponding to the keys of each member institution with content in the Portal. When readers search the Portal results are filtered through Vufind’s default record driver (web/RecordDrivers/IndexRecord.php) and the keys of member institutions (CRRAKey) are extracted. These keys are then incorporated into URL’s and hyperlinked to the directory in the localized theme (web/interface/themes/crranew/RecordDrivers/Index/result.tpl).
As new members provide content for the Portal appropriate contact information ought to added to the configuration file (etc/libraries.db) and the directory ought to be recreated using bin/directory.pl.
When the content management system becomes fully implemented some other way of connecting search results to contact information ought to be implemented.
Filemaker (membership directory, redux)
Filemaker — an Apple Computer database application for Macintosh and Windows — is used to maintain a more complete membership directory. It is made up of a couple of tables:individuals and institutions. They are joined by an institution value, and this there is a bit of relational database integrity going on. This database is primarily used to keep track of the many people and institutions of the Alliance. For example, people are denoted as liaisons or leaders of organizations. There are many steps in the membership process, and the database helps keep track of them too. There are a couple of scripts built into the database, and they generate lists of names and addresses. These lists were then used in an dynamic membership list in the Portal, but that functionality no longer exists.
It might behoove somebody to create a report against the Filemaker database that generates HTML lists. This HTML could then be pasted into Concrete5 for display purposes.
Logging and statistics
The Portal’s Apache log files are saved and compressed on a daily basis. Rudimentary statistical reports are then regularly generated from these log files. This is accomplished with the aid of the following locally written Perl and shell scripts, and they are all saved in the crra-scripts directory:
- bin/log-load.pl – By default, this script parses yesterday’s Apache log file, and saves the result in a MySQL database. Optionally this script can be given a range of dates on the command line, and it will parse many log files.
- bin/log-parse-queries.pl – This file parses a plain text file of HTTP GET queries. These queries represent the types of searches people have done against the Portal. After the plain text file is parsed the tabulation of queries is sorted and printed to STDOUT.
- bin/log-build.sh – This shell script executes a set of SQL statements against a MySQL database, and the results are sent to STDOUT. All of the SQL statements are located in the etc/sql directory. The consequence is a set of rudimentary statistical reports describing how the Portal is being used.
This whole process has been documented in a series of blog postings: 1) “Statistical reports against the ‘Catholic Portal’“, 2) “Data warehousing Web server log files“, and 3) “Progress with statistics reporting“.
Analog is used to provide rudimentary log file analysis. Its configuration file is saved in analog.cfg, and its output is sent to data/html/admin/statistics/index.html. Analog processes about one month’s worth of data. Analog’s functionality could be improved a bit through the use of a few more configurations, but the software does not seem to currently be maintained, and its output will really describe the health of the Web server, and not necessarily how the Portal is being used.
Full text indexing of digital content is supported through an application framework called Aperture. Implementing this full text indexing is not too difficult, but because our Vufind implementation has not been saved in the standard location, some configuration needs to be done:
- download and save Aperture to the file system
- turn on full text indexing by uncommenting the fulltext definition in Vufind’s marc_local.properties file
- hard-code the value of fulltextIniFile in getFulltext.bsh to point to the location of Vufind’s fulltext.ini file
- uncomment a value for webcrawler in fulltext.ini, and make sure the value points to Aperture’s crawling script (webcrawler.sh)
After these configurations/customizations are complete Vufind will extract URLs from MARC 856 subfield u fields during the indexing process. The URLs will be passed on to Aperture which will temporarily cache the file at the other end of the URL, do its best to extract the text from the downloaded file, return the text for inclusion into the Solr index, and delete the cached file.
Full texting indexing considerably slows down the indexing process. A good example includes the metadata from Notre Dame because it includes about a hundred pieces of “Catholic youth literature” and eventually thousands of pieces of Catholic pamphlets.
The majority of the Alliance’s traffic originates from Google searches. For this reason it behooves the application’s administrator to regularly generate sitemap files for the purposes of SEO — search engine optimization. The process begins by updating a configuration file (web/conf/sitemap.ini) and then running a PHP script (util/sitemap.php). This script ought to be executed by cron. The process is completed by maintaning the Alliance’s Webmasters site at Google. The most important maintenance feature is denoting the URL where local sitemaps are saved.