Data warehousing Web server log files

I have begun to create a data warehouse for CRRA (VuFind) Web server log files. This posting introduces the topic.

The problem

There is an understandable need/desire to know how well the “Catholic Portal” is operating. But for the life of me I was not able to enumerate metrics defining success. On the other hand, Pat Lawton had no problem listing quite a few. Here are most of her suggestions:

  • Are users looking at records?
  • Are users searching in English? Other languages?
  • Are users using field searches?
  • Can we get a sense of the number of records viewed per search?
  • Do we know how many searches resulted in zero hits?
  • How many hits came from a google search result? Or other search engine?
  • How many hits per day?
  • How many times were each institution’s records viewed?
  • How many times were the Web 2.0 things used?
  • How many users set up an account?
  • How often were the tabs at the top clicked on?
  • Per searches where records were looked at?
  • What is the average number of hits retrieved per search?
  • What percentage of queries resulted in an error message?
  • What sorts of search strings are entered?
  • When are the peak periods of use? Is there a pattern?
  • Where are users coming from?
  • Which geographic locations and types of institutions?

Continue reading “Data warehousing Web server log files”

CRRA in San Diego January 6, 2011

From left to right: Eric Morgan (ND), Eric Frierson (St. Ed’s), Marta Deyrup (Seton Hall), Clay Stalls (Loyola Marymount), Kris Brancolini (Loyola Marymount), Jennifer Younger (CRRA), Tyrone Cannon (Univ of San Francisco), Janice Welburn (Marquette), Jean Zanoni (Marquette), Pat Lawton (CRRA), Alma Ortega (Univ of San Diego), Theresa Byrd (Univ of San Diego), Susan Ohmer (Notre Dame), Laverna Saunders (Duquesne), Diane Maher (U San Diego), Ed Starkey (U San Diego)

The San Diego meeting provided an opportunity for new and continuing CRRA members and friends to look at the enhanced portal, discuss future directions for the CRRA,  and last but not least,  to get to know one another.

CRRA in San Diego Jan. 6, 2011

We look forward to seeing many of you in San Diego for our upcoming meeting.  Full details follow and are on the web at

Portal development is a focal point for this meeting.  Many milestones have been met and Eric will demonstrate new portal functionality including Web 2.0 features of VuFind, an EAD indexing and display tool, and text mining techniques to facilitate discovery and creation of new knowledge.

For those of you unable to join us on-site, please join via the live webcast.  You may virtually join the meeting at any time, simply by clicking on this link:

Continue reading “CRRA in San Diego Jan. 6, 2011”

CRRA in San Diego

This is a simple annotated list of links used as an outline for a presentation to the CRRA in San Diego:

  1. CRRA website – The good ol’ look & feel but wrapped around new content and functionality. (“Thank you, Eric Frierson!”)
  2. Web 2.0 – All the Web 2.0 links (cite this, email this, favorite this) that did not work previously now function correctly.
  3. EAD viewer – It is now possible to view EAD files locally or from the originating institution.
  4. Item-level indexing – The content of EAD files is indexed at the item level making for finer-grained searching.
  5. PDF display – Records linking to digitized versions of books now enable a person to get the full text. Examples include content from the St. Michael’s and the University of Notre Dame
  6. Text mining – After extracting the full text from the PDF documents, it is possible to apply concordancing techniques to the full text for analysis.
  7. Automated updating – The “Portal” can be updated automatically by harvesting metadata from member institutions, massaging it for the Portal, and re-indexing it on a regular basis.
  8. Use statistics – Rudimentary Web server log file analysis as well as Google Analytics reports illustrate how the Portal is being used.
  9. Blog – A running commentary on what’s happening with Portal development.

Simple log file analysis

Today I did a bit of simple log file analysis against the Portal’s Apache log file. Specifically, I wanted to extract the queries people have been using.

Naturally, I wrote a program to do this work — It is rather brain-dead and certainly not 100 percent accurate, but it goes generate a report of some value.

In the end, the Portal was queried approximately 18,000 from September to December in 2010. The report itself lists the top 100 queries and the number of times they were searched. The top 5 and the number of searches are:

  1. Meditations (1462)
  2. Cardinal virtues (918)
  3. Newman, John Henry 1801-1890 (349)
  4. Apostles (192)
  5. Theological virtues (184)

The report also lists each query searched only once. Here’s a random sample:

“Christian saints Algeria Hippo (Extinct city) Biography.” * “Christopher Hollis” * “De rege et regis institutione” * “DeAndreis, John A. 1920-1979” * “John Pearson (bishop)” * “John Pearson (cricketer)” * “John R. Cavanaugh” * “John R. Ryan” * “John Richard Parker” * “John Robert * Church year sermons Early works to 1800 * Church year sermons Early works to 1800 Indexes. * Self-esteem * Self-evaluation. * Seminary * Senigallia * Sermons, Chinese * Sermons, English * Sermons, German Early works to 1800 * pontificalia * portavoz * portrait * postmodernity * worldview * wrestling * yellow fever * younger * zill

I think the value of 18,000 queries is high. I will have to investigate that. Based on the queries, I believe most people are browsing the system and not necessarily entring specific queries. Why do I think this? Well, who puts in all of that syntax when searching?