ASPIRE: Assistive System for Performance Evaluation in IR
Georgios Peikos, Wojciech Kusa, Symeon Symeonidis
TL;DR
ASPIRE addresses the complex task of IR evaluation by providing a web-based visual analytics platform that enables in-depth, multi-faceted analysis of IR experiments beyond traditional metrics. Built with Python and Streamlit, it integrates standard IR tools and statistical methods to support single- and multi-run comparisons, query-level and query-characteristics analyses, and collection-level retrieval insights, demonstrated on the TREC Clinical Trials corpus. Its modular architecture, input validation, and exportable outputs facilitate reproducibility and easy adoption by researchers, organizers, and practitioners, both online and locally. By linking retrieval results with publication data, ASPIRE promotes transparency and deeper engagement with experimental evidence, with ongoing work to extend its capabilities and adoption in the IR community.
Abstract
Information Retrieval (IR) evaluation involves far more complexity than merely presenting performance measures in a table. Researchers often need to compare multiple models across various dimensions, such as the Precision-Recall trade-off and response time, to understand the reasons behind the varying performance of specific queries for different models. We introduce ASPIRE (Assistive System for Performance Evaluation in IR), a visual analytics tool designed to address these complexities by providing an extensive and user-friendly interface for in-depth analysis of IR experiments. ASPIRE supports four key aspects of IR experiment evaluation and analysis: single/multi-experiment comparisons, query-level analysis, query characteristics-performance interplay, and collection-based retrieval analysis. We showcase the functionality of ASPIRE using the TREC Clinical Trials collection. ASPIRE is an open-source toolkit available online: https://github.com/GiorgosPeikos/ASPIRE
