Table of Contents
Fetching ...

A Semantic Search Pipeline for Causality-driven Adhoc Information Retrieval

Dhairya Dalal, Sharmi Dev Gupta, Bentolhoda Binaei

TL;DR

CAIR-2021 challenges the retrieval of documents that describe causal factors for a query event, not just topical relevance. The authors propose an unsupervised semantic search pipeline that fuses signals from a lexical BM25 index and a semantic embedding index, with an aggregator to merge results from multiple query strategies. Three query modes—Q1 (semantic embeddings), Q2 (lexical BM25), and Q3 (causal keywords from narrative)—are combined, while a post-query causal filter is explored but not adopted. On the CAIR-2021 test set, the approach achieves state-of-the-art performance, with substantial improvements in MAP and P@5 over traditional and pure-semantic baselines, demonstrating a strong, practical approach to causal information retrieval.

Abstract

We present a unsupervised semantic search pipeline for the Causality-driven Adhoc Information Retrieval (CAIR-2021) shared task. The CAIR shared task expands traditional information retrieval to support the retrieval of documents containing the likely causes of a query event. A successful system must be able to distinguish between topical documents and documents containing causal descriptions of events that are causally related to the query event. Our approach involves aggregating results from multiple query strategies over a semantic and lexical index. The proposed approach leads the CAIR-2021 leaderboard and outperformed both traditional IR and pure semantic embedding-based approaches.

A Semantic Search Pipeline for Causality-driven Adhoc Information Retrieval

TL;DR

CAIR-2021 challenges the retrieval of documents that describe causal factors for a query event, not just topical relevance. The authors propose an unsupervised semantic search pipeline that fuses signals from a lexical BM25 index and a semantic embedding index, with an aggregator to merge results from multiple query strategies. Three query modes—Q1 (semantic embeddings), Q2 (lexical BM25), and Q3 (causal keywords from narrative)—are combined, while a post-query causal filter is explored but not adopted. On the CAIR-2021 test set, the approach achieves state-of-the-art performance, with substantial improvements in MAP and P@5 over traditional and pure-semantic baselines, demonstrating a strong, practical approach to causal information retrieval.

Abstract

We present a unsupervised semantic search pipeline for the Causality-driven Adhoc Information Retrieval (CAIR-2021) shared task. The CAIR shared task expands traditional information retrieval to support the retrieval of documents containing the likely causes of a query event. A successful system must be able to distinguish between topical documents and documents containing causal descriptions of events that are causally related to the query event. Our approach involves aggregating results from multiple query strategies over a semantic and lexical index. The proposed approach leads the CAIR-2021 leaderboard and outperformed both traditional IR and pure semantic embedding-based approaches.

Paper Structure

This paper contains 12 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Example CAIR topic. Each topic consists of a query (title), which describes an event, and a narrative, which contains descriptions of documents that are causally relevant to the event.
  • Figure 2: Siamese sentence embedding architecture for asymmetric matching.
  • Figure 3: The semantic search pipeline aggregates results from three query strategies, $Q1$, $Q2$, and $Q3$. $Q1$ embeds the query using the sentence embedding model and retrieves the most relevant results based on cosine similarity. $Q2$ and $Q3$ retrieve the most relevant documents from the lexical index. $Q3$ adds filtering and keyword extraction steps to transform the narrative description in causal search terms. Finally results from all three queries ($Q1'$, $Q2'$, and $Q3'$) are aggregated and re-ranked by the aggregator module. The top 500 relevant submissions are returned.
  • Figure 4: Example results returned by Semantic Search Pipeline and the Narrative Only Okapi BM25 baseline. The baseline returns a topically relevant result based on keyword matches but fails to describe why Shashi Tharoor resigned.