Table of Contents
Fetching ...

Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIR

Nandan Thakur, Luiz Bonifacio, Maik Fröbe, Alexander Bondarenko, Ehsan Kamalloo, Martin Potthast, Matthias Hagen, Jimmy Lin

TL;DR

This work systematically investigates why neural retrieval models underperform BM25 on the Touché 2020 argument retrieval subset within BEIR. It conducts a two-stage reproducibility study—black-box evaluation and data denoising—augmented with inference-time document expansion and summarization, post-hoc relevance judgments, and axiomatic analysis. The findings show neural models favor short, noisy premises that harm effectiveness, with denoising and post-hoc judgments improving performance up to $0.52$ in $nDCG@10$ but BM25 remains superior. The work highlights the need for training-time strategies that incorporate document length normalization and stronger argument-quality signals, and provides a denoised, post-hoc judged Touché 2020 dataset for future research.

Abstract

The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark -- a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR subset Touché 2020, an argument retrieval task, neural retrieval models are considerably less effective than BM25. Still, so far, no further investigation has been conducted on what makes argument retrieval so "special". To more deeply analyze the respective potential limits of neural retrieval models, we run a reproducibility study on the Touché 2020 data. In our study, we focus on two experiments: (i) a black-box evaluation (i.e., no model retraining), incorporating a theoretical exploration using retrieval axioms, and (ii) a data denoising evaluation involving post-hoc relevance judgments. Our black-box evaluation reveals an inherent bias of neural models towards retrieving short passages from the Touché 2020 data, and we also find that quite a few of the neural models' results are unjudged in the Touché 2020 data. As many of the short Touché passages are not argumentative and thus non-relevant per se, and as the missing judgments complicate fair comparison, we denoise the Touché 2020 data by excluding very short passages (less than 20 words) and by augmenting the unjudged data with post-hoc judgments following the Touché guidelines. On the denoised data, the effectiveness of the neural models improves by up to 0.52 in nDCG@10, but BM25 is still more effective. Our code and the augmented Touché 2020 dataset are available at \url{https://github.com/castorini/touche-error-analysis}.

Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIR

TL;DR

This work systematically investigates why neural retrieval models underperform BM25 on the Touché 2020 argument retrieval subset within BEIR. It conducts a two-stage reproducibility study—black-box evaluation and data denoising—augmented with inference-time document expansion and summarization, post-hoc relevance judgments, and axiomatic analysis. The findings show neural models favor short, noisy premises that harm effectiveness, with denoising and post-hoc judgments improving performance up to in but BM25 remains superior. The work highlights the need for training-time strategies that incorporate document length normalization and stronger argument-quality signals, and provides a denoised, post-hoc judged Touché 2020 dataset for future research.

Abstract

The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark -- a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR subset Touché 2020, an argument retrieval task, neural retrieval models are considerably less effective than BM25. Still, so far, no further investigation has been conducted on what makes argument retrieval so "special". To more deeply analyze the respective potential limits of neural retrieval models, we run a reproducibility study on the Touché 2020 data. In our study, we focus on two experiments: (i) a black-box evaluation (i.e., no model retraining), incorporating a theoretical exploration using retrieval axioms, and (ii) a data denoising evaluation involving post-hoc relevance judgments. Our black-box evaluation reveals an inherent bias of neural models towards retrieving short passages from the Touché 2020 data, and we also find that quite a few of the neural models' results are unjudged in the Touché 2020 data. As many of the short Touché passages are not argumentative and thus non-relevant per se, and as the missing judgments complicate fair comparison, we denoise the Touché 2020 data by excluding very short passages (less than 20 words) and by augmenting the unjudged data with post-hoc judgments following the Touché guidelines. On the denoised data, the effectiveness of the neural models improves by up to 0.52 in nDCG@10, but BM25 is still more effective. Our code and the augmented Touché 2020 dataset are available at \url{https://github.com/castorini/touche-error-analysis}.
Paper Structure (28 sections, 5 figures, 8 tables)

This paper contains 28 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Boxplots showing the average length in words ($x$-axis) of the top-10 Touché 2020 results retrieved by the models on the $y$-axis (sorted by decreasing nDCG@10; oracle: avg. length of all documents judged as relevant). The results of the neural models are much shorter in comparison to BM25.
  • Figure 2: Vanilla zero-shot prompt template used in our work with GPT-3.5 ouyang:2022 to summarize Touché 2020 documents.
  • Figure 3: Change in effectiveness with DocT5query nogueira2019doc2query query expansions and GPT-3.5 ouyang:2022 summary replacement on Touché 2020. Both techniques improve the nDCG@10 for a majority of the neural models.
  • Figure 4: Document length distribution in Touché 2020 vs. MS MARCO ($x$-axis: document length in words; log-scaled $y$-axis: frequency of document lengths). Touché 2020 has a monotonically decreasing broad distribution, while the MS MARCO distribution is much narrower.
  • Figure 5: Denoising experiment to determine the best threshold $n$ for filtering out short documents in Touché 2020. All models improve (until a maximum of 20 words) in effectiveness with data denoising in Touché 2020.