Evaluating the Explainability of Neural Rankers
Saran Pandian, Debasis Ganguly, Sean MacAvaney
TL;DR
The paper addresses the explainability of neural ranking approaches by introducing an offline framework that requires rationales alongside retrieved documents. It defines intrinsic and extrinsic metrics, Mean Rank Correlation $MRC$ and Mean Explanation Relevance $MER$, to assess how faithful and relevant explanations are across NRMs using a common occlusion-based method. Across MS-MARCO passage and document collections, results show that the most effective retrievers are not necessarily the most explainable, sentence-level rationales tend to be more consistent, and increasing explanation complexity can reduce interpretability, highlighting the value of explainability as a complementary IR evaluation dimension. The framework offers a practical, model-agnostic way to quantify explainability and could inspire tools and user studies to better communicate how retrieval decisions are made.
Abstract
Information retrieval models have witnessed a paradigm shift from unsupervised statistical approaches to feature-based supervised approaches to completely data-driven ones that make use of the pre-training of large language models. While the increasing complexity of the search models have been able to demonstrate improvements in effectiveness (measured in terms of relevance of top-retrieved results), a question worthy of a thorough inspection is - "how explainable are these models?", which is what this paper aims to evaluate. In particular, we propose a common evaluation platform to systematically evaluate the explainability of any ranking model (the explanation algorithm being identical for all the models that are to be evaluated). In our proposed framework, each model, in addition to returning a ranked list of documents, also requires to return a list of explanation units or rationales for each document. This meta-information from each document is then used to measure how locally consistent these rationales are as an intrinsic measure of interpretability - one that does not require manual relevance assessments. Additionally, as an extrinsic measure, we compute how relevant these rationales are by leveraging sub-document level relevance assessments. Our findings show a number of interesting observations, such as sentence-level rationales are more consistent, an increase in complexity mostly leads to less consistent explanations, and that interpretability measures offer a complementary dimension of evaluation of IR systems because consistency is not well-correlated with nDCG at top ranks.
