A Semantic Search Pipeline for Causality-driven Adhoc Information Retrieval
Dhairya Dalal, Sharmi Dev Gupta, Bentolhoda Binaei
TL;DR
CAIR-2021 challenges the retrieval of documents that describe causal factors for a query event, not just topical relevance. The authors propose an unsupervised semantic search pipeline that fuses signals from a lexical BM25 index and a semantic embedding index, with an aggregator to merge results from multiple query strategies. Three query modes—Q1 (semantic embeddings), Q2 (lexical BM25), and Q3 (causal keywords from narrative)—are combined, while a post-query causal filter is explored but not adopted. On the CAIR-2021 test set, the approach achieves state-of-the-art performance, with substantial improvements in MAP and P@5 over traditional and pure-semantic baselines, demonstrating a strong, practical approach to causal information retrieval.
Abstract
We present a unsupervised semantic search pipeline for the Causality-driven Adhoc Information Retrieval (CAIR-2021) shared task. The CAIR shared task expands traditional information retrieval to support the retrieval of documents containing the likely causes of a query event. A successful system must be able to distinguish between topical documents and documents containing causal descriptions of events that are causally related to the query event. Our approach involves aggregating results from multiple query strategies over a semantic and lexical index. The proposed approach leads the CAIR-2021 leaderboard and outperformed both traditional IR and pure semantic embedding-based approaches.
