Table of Contents
Fetching ...

Reproducible Hybrid Time-Travel Retrieval in Evolving Corpora

Moritz Staudinger, Florina Piroi, Andreas Rauber

TL;DR

This work presents a hybrid retrieval system combining Lucene for fast retrieval with a column-store-based retrieval system maintaining a versioned and time-stamped index, which ensures retrieval results in evolving document collections are fully reproducible even when document collections and thus term statistics change.

Abstract

There are settings in which reproducibility of ranked lists is desirable, such as when extracting a subset of an evolving document corpus for downstream research tasks or in domains such as patent retrieval or in medical systematic reviews, with high reproducibility expectations. However, as global term statistics change when documents change or are added to a corpus, queries using typical ranked retrieval models are not even reproducible for the parts of the document corpus that have not changed. Thus, Boolean retrieval frequently remains the mechanism of choice in such settings. We present a hybrid retrieval system combining Lucene for fast retrieval with a column-store-based retrieval system maintaining a versioned and time-stamped index. The latter component allows re-execution of previously posed queries resulting in the same ranked list and further allows for time-travel queries over evolving collection, as web archives, while maintaining the original ranking. Thus, retrieval results in evolving document collections are fully reproducible even when document collections and thus term statistics change.

Reproducible Hybrid Time-Travel Retrieval in Evolving Corpora

TL;DR

This work presents a hybrid retrieval system combining Lucene for fast retrieval with a column-store-based retrieval system maintaining a versioned and time-stamped index, which ensures retrieval results in evolving document collections are fully reproducible even when document collections and thus term statistics change.

Abstract

There are settings in which reproducibility of ranked lists is desirable, such as when extracting a subset of an evolving document corpus for downstream research tasks or in domains such as patent retrieval or in medical systematic reviews, with high reproducibility expectations. However, as global term statistics change when documents change or are added to a corpus, queries using typical ranked retrieval models are not even reproducible for the parts of the document corpus that have not changed. Thus, Boolean retrieval frequently remains the mechanism of choice in such settings. We present a hybrid retrieval system combining Lucene for fast retrieval with a column-store-based retrieval system maintaining a versioned and time-stamped index. The latter component allows re-execution of previously posed queries resulting in the same ranked list and further allows for time-travel queries over evolving collection, as web archives, while maintaining the original ranking. Thus, retrieval results in evolving document collections are fully reproducible even when document collections and thus term statistics change.

Paper Structure

This paper contains 11 sections, 5 figures.

Figures (5)

  • Figure 1: Excerpt of the DB schema for versioned column store based retrieval, primary keys bold
  • Figure 2: Indexing time Lucene (incl. parsing, stemming, stopword-filtering), and overhead for MonetDB index updates, Batch-Inserts of 20.000 documents.
  • Figure 3: Query processing time over growing corpus size
  • Figure 4: Evolution of the average differences between the scores returned by Lucene and VCBR due to differences in floating point operations
  • Figure 5: Average Difference of consecutive scoring results in MonetDB