Simple log file analysis

Today I did a bit of simple log file analysis against the Portal’s Apache log file. Specifically, I wanted to extract the queries people have been using.

Naturally, I wrote a program to do this work — parse.pl. It is rather brain-dead and certainly not 100 percent accurate, but it goes generate a report of some value.

In the end, the Portal was queried approximately 18,000 from September to December in 2010. The report itself lists the top 100 queries and the number of times they were searched. The top 5 and the number of searches are:

  1. Meditations (1462)
  2. Cardinal virtues (918)
  3. Newman, John Henry 1801-1890 (349)
  4. Apostles (192)
  5. Theological virtues (184)

The report also lists each query searched only once. Here’s a random sample:

“Christian saints Algeria Hippo (Extinct city) Biography.” * “Christopher Hollis” * “De rege et regis institutione” * “DeAndreis, John A. 1920-1979” * “John Pearson (bishop)” * “John Pearson (cricketer)” * “John R. Cavanaugh” * “John R. Ryan” * “John Richard Parker” * “John Robert * Church year sermons Early works to 1800 * Church year sermons Early works to 1800 Indexes. * Self-esteem * Self-evaluation. * Seminary * Senigallia * Sermons, Chinese * Sermons, English * Sermons, German Early works to 1800 * pontificalia * portavoz * portrait * postmodernity * worldview * wrestling * yellow fever * younger * zill

I think the value of 18,000 queries is high. I will have to investigate that. Based on the queries, I believe most people are browsing the system and not necessarily entring specific queries. Why do I think this? Well, who puts in all of that syntax when searching?