There is some concern about duplicate records in the “Catholic Portal”, and this posting introduces the topic to a wider audience.
The “Catholic Portal” is intended to contain links to and content of a rare and infrequently held nature. Every once in a while search results return duplicate records. For example, yesterday, it was brought to our attention that there are five records with the title Life Of Mrs. Eliza A. Seton. On one hand, few if any of these records are duplicates because between the five of them they are held by two different institutions. And each institution owns multiple editions. In the sense of a “catalog”, this is perfectly acceptable, if not expected. On the other hand, the Portal is not a catalog but rather an index, and each of the five items are really a variation on a theme. Should these records be merged?
Demian Katz shared with me and the Portal’s Digital Access Committee a query that can be applied the Portal’s underlying Solr index, here, with carriage returns added for readability:
http://localhost:8080/solr/biblio/select/? q=*%3A*&rows=0&start=0&facet=true&facet.mincount=2& facet.limit=-1&facet.field=oclc_num&facet.field=isbn
The result of this query is a list of OCLC and ISBN numbers which occur in the index at least two times. According to the result, which only matches on the OCLC or ISBN keys, there are no records in the index appearing more than three times. Furthermore, there are about 1,100 duplicated OCLC numbers and about 300 duplicated ISBN numbers. Considering the total number of records (93,000) in the index, this represents a total duplication rate of approximately 1.5%. Is this value too high?
In an ideal world, there would be no duplicate records and/or duplicates would be merged into a single record. Unfortunately, the definition of “duplicate” is ambiguous, and a process for eliminating duplicates has not been implemented. To a Walt Witman scholar, the difference between various editions of The Leaves Of Grass is definitely significant. Thus, sometimes the differences in editions is very important. Other times and for other people, this is not always so important. In an ideal world, there would be no duplicates and a single record would warrant a de-duplication process, but the expense of de-duplicating that single record may be very high, especially if there is no de-duplication process in place. How many records — or what percentage of records — warrants a de-duplication process, especially considering the other things that have been set as priorities for the Portal? Honestly, I don’t know the answer.