Table of Contents
Fetching ...

Evaluating Search System Explainability with Psychometrics and Crowdsourcing

Catherine Chen, Carsten Eickhoff

TL;DR

This paper tackles the lack of a standardized, multidimensional notion of explainability in information retrieval by proposing SSE, a continuous metric built on psychometrics and crowdsourcing. It develops a two-factor model of explainability—utility and roadblocks—via exploratory and confirmatory factor analysis, and then formalizes SSE to quantify explainability using factor loadings and item responses. A large-scale crowdsourced study demonstrates that SSE distinguishes between explainable (BARS) and non-explainable (BASELINE) search interfaces, with BARS achieving higher scores and greater efficiency in some metrics. The work offers a practical framework for evaluating and improving explainability in IR and provides a blueprint for applying similar human-centered evaluation in other ML/NLP domains.

Abstract

As information retrieval (IR) systems, such as search engines and conversational agents, become ubiquitous in various domains, the need for transparent and explainable systems grows to ensure accountability, fairness, and unbiased results. Despite recent advances in explainable AI and IR techniques, there is no consensus on the definition of explainability. Existing approaches often treat it as a singular notion, disregarding the multidimensional definition postulated in the literature. In this paper, we use psychometrics and crowdsourcing to identify human-centered factors of explainability in Web search systems and introduce SSE (Search System Explainability), an evaluation metric for explainable IR (XIR) search systems. In a crowdsourced user study, we demonstrate SSE's ability to distinguish between explainable and non-explainable systems, showing that systems with higher scores indeed indicate greater interpretability. We hope that aside from these concrete contributions to XIR, this line of work will serve as a blueprint for similar explainability evaluation efforts in other domains of machine learning and natural language processing.

Evaluating Search System Explainability with Psychometrics and Crowdsourcing

TL;DR

This paper tackles the lack of a standardized, multidimensional notion of explainability in information retrieval by proposing SSE, a continuous metric built on psychometrics and crowdsourcing. It develops a two-factor model of explainability—utility and roadblocks—via exploratory and confirmatory factor analysis, and then formalizes SSE to quantify explainability using factor loadings and item responses. A large-scale crowdsourced study demonstrates that SSE distinguishes between explainable (BARS) and non-explainable (BASELINE) search interfaces, with BARS achieving higher scores and greater efficiency in some metrics. The work offers a practical framework for evaluating and improving explainability in IR and provides a blueprint for applying similar human-centered evaluation in other ML/NLP domains.

Abstract

As information retrieval (IR) systems, such as search engines and conversational agents, become ubiquitous in various domains, the need for transparent and explainable systems grows to ensure accountability, fairness, and unbiased results. Despite recent advances in explainable AI and IR techniques, there is no consensus on the definition of explainability. Existing approaches often treat it as a singular notion, disregarding the multidimensional definition postulated in the literature. In this paper, we use psychometrics and crowdsourcing to identify human-centered factors of explainability in Web search systems and introduce SSE (Search System Explainability), an evaluation metric for explainable IR (XIR) search systems. In a crowdsourced user study, we demonstrate SSE's ability to distinguish between explainable and non-explainable systems, showing that systems with higher scores indeed indicate greater interpretability. We hope that aside from these concrete contributions to XIR, this line of work will serve as a blueprint for similar explainability evaluation efforts in other domains of machine learning and natural language processing.
Paper Structure (21 sections, 1 equation, 4 figures, 5 tables)

This paper contains 21 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Search interface shown to Group A (modeled on the basis of the system presented by Ramos and Eickhoff ramos2020search). On the left-hand side, the stacked bar graphs depict hypothetical scores of each keyword in the query for each respective search result. The larger the stacked bar graph, the more relevant that result is to the query.
  • Figure 2: Path diagram for proposed structural equation model for modeling explainability.
  • Figure 3: Distribution of SSE scores. Results from a Wilcoxon signed rank test (T = 98.0, p < 0.001) indicate there is a statistically significant difference in scores between systems.
  • Figure 4: Loadings for individual questionnaire items. Items are labeled by their original aspect as determined and ordered by score (response multiplied by item loading) difference between the two systems. Aspects with bars increasing toward the right indicate the practical usefulness of each system, while aspects with bars increasing toward the left represent areas requiring improvement to achieve full explainability.