FTP site as dropbox

In an effort to make it easier for us here at the “Catholic Portal Home Planet”, we have implemented an FTP site designed to be used as a dropbox.

For the longest time Catholic Research Resources Alliance (CRRA) members sent me their metadata via email. I was then expected to parse it, index it, and make it available for searching. A hidden task in this scenario was archiving the metadata — a task that is not really very scalable. Consequently, I advocated CRRA members make their metadata available via a website where I could then harvest the data with us. Unfortunately and to my surprise, not every CRRA member was able to do this mostly because of local infrastructure policies.

To overcome the limitations of some CRRA members, I created an FTP site allowing them to deposit their metadata. This same FTP site is also accessible via the Web, and therefore I can have my cake and eat it too. No CRRA members need to send me their metadata, and I can harvest it from a Web server.

If you are a CRRA member who is unable or not allowed to make your metadata available via the Web, then get in touch with me, Eric Lease Morgan (574/631-8604; emorgan@nd.edu), and I will give you instructions for making your metadata available via the “dropbox”.

Statistical reports against the “Catholic Portal”

This text describes the beginnings of a set of statistical reports describing the use of the “Catholic Portal“.

More specifically, the Portal’s Web server log files are read on a daily basis, normalized, and saved to an underlying database. A number of queries are then applied to the database to create rudimentarily lists of tabulations. Each one of the reports are described below:

  • Hosts – This report lists the Internet address or name of the top 100 computers using the Portal. To the best of our ability, the list excludes Internet robots and spiders, but the list needs to be updated. As of this writing, it is quite likely that many of the top computers are still robots, and the host named university.archives.nd.edu is probably the most frequent user of the Portal with shunat236-189.shu.edu coming in at a close second.
  • Page count – This is a list of the number of hits the Portal received on any given day. Obviously the script creating this report needs to be updated in order to output data for the current year.
  • Query strings – This is a tabulation of the most frequently used search terms applied against the Portal. The “null” query is probably a simple hit against the “browse” link at the bottom of the Portal’s home page and/or simply clicking the search box’s Find button. The queries in quotes are probably from clicks on hot linked search results.
  • Referrers – This is a list of the websites where people came from before they visited the Portal. A whole lot of these websites are places where blog postings about the Portal appear. Many are spam. Some are HTML versions of the EAD finding aids. Further down the list one can begin to see Google searches.
  • Referrers engines – This report is just exactly like the Referrers report except it only includes search engines (Google, Yahoo, and Bing).
  • Tabs – This is a list of the most frequently used links used across the top of the Portal’s home page.
  • Top records – This is a tabulation of the most frequently viewed records in the Portal. The first item on the list is an error, but as of this writing the most frequently viewed record is something from Catholic University of America.
  • Types of searches – From this report is all but obvious that the overwhelming majority of the searches applied against the Portal are free text searches. Nobody uses the advanced search form.
  • Whose records – This is a list of the names of the libraries/institutions whose records are viewed most frequently.

For a more technical description of how these reports are generated, see the blog posting entitled “Data warehousing Web server log files” as well as a follow-up posting called “Progress with statistics reporting“.

These reports can be improved in any number of ways. First, they could be represented graphically — pie charts, histograms, etc. Second, they could be re-generated on a month-by-month basis to look for trends over time. Luckily just about all the necessary data has been preserved. Alternatively, a peek at the Portal’s Google Analystics site may illuminate additional trends.

Prioritized list of fixes/enhancements for the “Portal”

Based on our usability studies and conference call from the other day I have created a (more or less) prioritized list of fixes/enhancements to be applied to the “Portal”:

  • add a a note to the email dialog box denoting how the from field is mandatory and requires an email address
  • create a directory of institutions, and from search results hyperlink institutions’ names to the directory
  • update the “Portal” look & feel (theme) so it is based on the “blueprint” theme
  • turn off the “Suggested Topics” feature
  • fix the author searches so when author names are clicked the content displays correctly
  • make the login links float to the right instead of the left
  • change the red text — such as the text in the search box — to black
  • change the login label to read “Login / Create account”

On my mark. Get set. Go.

Indexing PastPerfect metadata for the “Catholic Portal”

Using VuFind’s inherent ability to index OAI metadata, I have successfully been able to index metadata coming from a PastPerfect implementation.

Starting somewhere near version 1.2, VuFind supports the indexing of arbitrary metadata types. Content from OAI repositories was the original example. Later, I figured out how to index EAD files. This was a break through for the “Portal”. Give credit to open source software.

Continue reading “Indexing PastPerfect metadata for the “Catholic Portal””

Duplicate records in the “Catholic Portal”

There is some concern about duplicate records in the “Catholic Portal”, and this posting introduces the topic to a wider audience.

The “Catholic Portal” is intended to contain links to and content of a rare and infrequently held nature. Every once in a while search results return duplicate records. For example, yesterday, it was brought to our attention that there are five records with the title Life Of Mrs. Eliza A. Seton. On one hand, few if any of these records are duplicates because between the five of them they are held by two different institutions. And each institution owns multiple editions. In the sense of a “catalog”, this is perfectly acceptable, if not expected. On the other hand, the Portal is not a catalog but rather an index, and each of the five items are really a variation on a theme. Should these records be merged?

Continue reading “Duplicate records in the “Catholic Portal””