Table of Contents
Fetching ...

Evaluating the Explainability of Neural Rankers

Saran Pandian, Debasis Ganguly, Sean MacAvaney

TL;DR

The paper addresses the explainability of neural ranking approaches by introducing an offline framework that requires rationales alongside retrieved documents. It defines intrinsic and extrinsic metrics, Mean Rank Correlation $MRC$ and Mean Explanation Relevance $MER$, to assess how faithful and relevant explanations are across NRMs using a common occlusion-based method. Across MS-MARCO passage and document collections, results show that the most effective retrievers are not necessarily the most explainable, sentence-level rationales tend to be more consistent, and increasing explanation complexity can reduce interpretability, highlighting the value of explainability as a complementary IR evaluation dimension. The framework offers a practical, model-agnostic way to quantify explainability and could inspire tools and user studies to better communicate how retrieval decisions are made.

Abstract

Information retrieval models have witnessed a paradigm shift from unsupervised statistical approaches to feature-based supervised approaches to completely data-driven ones that make use of the pre-training of large language models. While the increasing complexity of the search models have been able to demonstrate improvements in effectiveness (measured in terms of relevance of top-retrieved results), a question worthy of a thorough inspection is - "how explainable are these models?", which is what this paper aims to evaluate. In particular, we propose a common evaluation platform to systematically evaluate the explainability of any ranking model (the explanation algorithm being identical for all the models that are to be evaluated). In our proposed framework, each model, in addition to returning a ranked list of documents, also requires to return a list of explanation units or rationales for each document. This meta-information from each document is then used to measure how locally consistent these rationales are as an intrinsic measure of interpretability - one that does not require manual relevance assessments. Additionally, as an extrinsic measure, we compute how relevant these rationales are by leveraging sub-document level relevance assessments. Our findings show a number of interesting observations, such as sentence-level rationales are more consistent, an increase in complexity mostly leads to less consistent explanations, and that interpretability measures offer a complementary dimension of evaluation of IR systems because consistency is not well-correlated with nDCG at top ranks.

Evaluating the Explainability of Neural Rankers

TL;DR

The paper addresses the explainability of neural ranking approaches by introducing an offline framework that requires rationales alongside retrieved documents. It defines intrinsic and extrinsic metrics, Mean Rank Correlation and Mean Explanation Relevance , to assess how faithful and relevant explanations are across NRMs using a common occlusion-based method. Across MS-MARCO passage and document collections, results show that the most effective retrievers are not necessarily the most explainable, sentence-level rationales tend to be more consistent, and increasing explanation complexity can reduce interpretability, highlighting the value of explainability as a complementary IR evaluation dimension. The framework offers a practical, model-agnostic way to quantify explainability and could inspire tools and user studies to better communicate how retrieval decisions are made.

Abstract

Information retrieval models have witnessed a paradigm shift from unsupervised statistical approaches to feature-based supervised approaches to completely data-driven ones that make use of the pre-training of large language models. While the increasing complexity of the search models have been able to demonstrate improvements in effectiveness (measured in terms of relevance of top-retrieved results), a question worthy of a thorough inspection is - "how explainable are these models?", which is what this paper aims to evaluate. In particular, we propose a common evaluation platform to systematically evaluate the explainability of any ranking model (the explanation algorithm being identical for all the models that are to be evaluated). In our proposed framework, each model, in addition to returning a ranked list of documents, also requires to return a list of explanation units or rationales for each document. This meta-information from each document is then used to measure how locally consistent these rationales are as an intrinsic measure of interpretability - one that does not require manual relevance assessments. Additionally, as an extrinsic measure, we compute how relevant these rationales are by leveraging sub-document level relevance assessments. Our findings show a number of interesting observations, such as sentence-level rationales are more consistent, an increase in complexity mostly leads to less consistent explanations, and that interpretability measures offer a complementary dimension of evaluation of IR systems because consistency is not well-correlated with nDCG at top ranks.
Paper Structure (18 sections, 5 equations, 6 figures, 3 tables)

This paper contains 18 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The proposed workflow for measuring effectiveness of explainable IR models via intrinsic and extrinsic explanation effectiveness measures in addition to relevance-based ones. The meta-information comprised of the explanation units or rationales (shown in blue) is the additional output either obtained from a ranking model itself or with the help of a common explanation methodology (as is the case in our experiments).
  • Figure 2: Effects of varying the size of the explanation units on the intrinsic consistency of the explanations (MRC of Equation \ref{['eq:mrc']}) on the top-10 search results obtained with several NRMs. A comparison across the plots reveals the the effect of the variations in the number of rationales provided as explanations ($m$). A general observation is that a higher number of explanations coupled with larger explanation units tend to provide more consistent explanations.
  • Figure 3: Effect of the number of rationales on the explanation consistency metrics across different NRMs for the MS-MARCO document ranking task.
  • Figure 4: Per-query comparisons of the relevance and intrinsic explanation consistency measures (MRC) for different IR models on the MS-MARCO passage ranking task.
  • Figure 5: Per-query comparisons of the relevance and intrinsic explanation consistency measures (MRC) for different IR models on the MS-MARCO document ranking task.
  • ...and 1 more figures