High Recall, Small Data: The Challenges of Within-System Evaluation in a Live Legal Search System
Gineke Wiggers, Suzan Verberne, Arjen de Vries, Roel van der Burg
TL;DR
The paper tackles the problem of evaluating ranking changes within a live legal information retrieval system, arguing that standard IR evaluation methods often fail to capture the realities of professional legal search. It analyzes four common approaches—test collections, implicit feedback, surveys, and A/B testing—using data from Legal Intelligence and two user studies to illustrate distinct domain challenges such as high recall dependence, small data, subscription-based access, and jurisdictional constraints. The findings show that test collections are biased toward early precision, implicit feedback is sparse and unreliable, user surveys yield inconclusive discrimination between rankings, and A/B testing is often not feasible in practice. The authors advocate pursuing less common evaluation methods, notably cost-based evaluation models, to better assess ranking changes in within-system scenarios for mid-sized professional IR systems.
Abstract
This paper illustrates some challenges of common ranking evaluation methods for legal information retrieval (IR). We show these challenges with log data from a live legal search system and two user studies. We provide an overview of aspects of legal IR, and the implications of these aspects for the expected challenges of common evaluation methods: test collections based on explicit and implicit feedback, user surveys, and A/B testing. Next, we illustrate the challenges of common evaluation methods using data from a live, commercial, legal search engine. We specifically focus on methods for monitoring the effectiveness of (continuous) changes to document ranking by a single IR system over time. We show how the combination of characteristics in legal IR systems and limited user data can lead to challenges that cause the common evaluation methods discussed to be sub-optimal. In our future work we will therefore focus on less common evaluation methods, such as cost-based evaluation models.
