Table of Contents
Fetching ...

SSE: A Metric for Evaluating Search System Explainability

Catherine Chen, Carsten Eickhoff

TL;DR

This work addresses the lack of standardized evaluation for explainable information retrieval by introducing Search System Explainability (SSE), a continuous, multidimensional metric derived from a two-factor model of utility and roadblocks. SSE aggregates responses from a 19-item questionnaire using loading coefficients to produce a bounded score via MinMax normalization, enabling quantitative comparison of explainability across systems. A crowdsourced study contrasts a baseline non-explainable system with an explainable system that visualizes term importance, demonstrating that SSE can reliably distinguish explainable from non-explainable systems (mean SSE 0.67 vs 0.44, p < 0.001) and revealing comparable task load between native and non-native English speakers. The findings provide a practical blueprint for evaluating explainability in XIR and potentially generalize to other ML/NLP domains, with implications for system design and user trust.

Abstract

Explainable Information Retrieval (XIR) is a growing research area focused on enhancing transparency and trustworthiness of the complex decision-making processes taking place in modern information retrieval systems. While there has been progress in developing XIR systems, empirical evaluation tools to assess the degree of explainability attained by such systems are lacking. To close this gap and gain insights into the true merit of XIR systems, we extend existing insights from a factor analysis of search explainability to introduce SSE (Search System Explainability), an evaluation metric for XIR search systems. Through a crowdsourced user study, we demonstrate SSE's ability to distinguish between explainable and non-explainable systems, showing that systems with higher scores indeed indicate greater interpretability. Additionally, we observe comparable perceived temporal demand and performance levels between non-native and native English speakers. We hope that aside from these concrete contributions to XIR, this line of work will serve as a blueprint for similar explainability evaluation efforts in other domains of machine learning and natural language processing.

SSE: A Metric for Evaluating Search System Explainability

TL;DR

This work addresses the lack of standardized evaluation for explainable information retrieval by introducing Search System Explainability (SSE), a continuous, multidimensional metric derived from a two-factor model of utility and roadblocks. SSE aggregates responses from a 19-item questionnaire using loading coefficients to produce a bounded score via MinMax normalization, enabling quantitative comparison of explainability across systems. A crowdsourced study contrasts a baseline non-explainable system with an explainable system that visualizes term importance, demonstrating that SSE can reliably distinguish explainable from non-explainable systems (mean SSE 0.67 vs 0.44, p < 0.001) and revealing comparable task load between native and non-native English speakers. The findings provide a practical blueprint for evaluating explainability in XIR and potentially generalize to other ML/NLP domains, with implications for system design and user trust.

Abstract

Explainable Information Retrieval (XIR) is a growing research area focused on enhancing transparency and trustworthiness of the complex decision-making processes taking place in modern information retrieval systems. While there has been progress in developing XIR systems, empirical evaluation tools to assess the degree of explainability attained by such systems are lacking. To close this gap and gain insights into the true merit of XIR systems, we extend existing insights from a factor analysis of search explainability to introduce SSE (Search System Explainability), an evaluation metric for XIR search systems. Through a crowdsourced user study, we demonstrate SSE's ability to distinguish between explainable and non-explainable systems, showing that systems with higher scores indeed indicate greater interpretability. Additionally, we observe comparable perceived temporal demand and performance levels between non-native and native English speakers. We hope that aside from these concrete contributions to XIR, this line of work will serve as a blueprint for similar explainability evaluation efforts in other domains of machine learning and natural language processing.
Paper Structure (11 sections, 1 equation, 4 figures, 2 tables)

This paper contains 11 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: BARS search system based on the system presented by ramos2020search. Stacked bar graphs, located alongside each search result, depict the BM25 scores assigned to individual keywords in the search query. Hovering over a bar reveals the corresponding score for the respective term.
  • Figure 2: Loadings for individual questionnaire items. Items are labeled by their original aspect as determined by chen2022evaluating and ordered by score (response multiplied by item loading) difference between the two systems. Aspects with bars increasing toward the right indicate the practical usefulness of each system, while aspects with bars increasing toward the left represent areas requiring improvement to achieve full explainability.
  • Figure 3: Distribution of SSE scores. Results from a Wilcoxon signed rank test (T = 98.0, p < 0.001) indicate there is a statistically significant difference in scores between systems.
  • Figure 4: NASA-RTLX responses between native (n=51) and non-native (n=49) English speakers. While non-native English speakers have a higher perceived task load than native English speakers on 3 dimensions, they had a lower perceived Temporal Demand and Effort for a higher perceived Performance.