Table of Contents
Fetching ...

Fewer Than 1% of Explainable AI Papers Validate Explainability with Humans

Ashley Suh, Isabelle Hurley, Nora Smith, Ho Chit Siu

TL;DR

Explainability claims in AI are widespread but rarely validated with human users. The authors perform a librarian-assisted, large-scale Scopus audit of 18,254 XAI papers, identifying 253 with human-related claims and 128 with human studies, using a reproducible scoring scheme. They find that only $0.7\%$ of XAI papers provide empirical evidence from human evaluations, highlighting a major gap between claims and validation. The work urges a shift toward mandatory, human-centered evaluation in XAI and provides a transparent methodology to enable replication and further investigation.

Abstract

This late-breaking work presents a large-scale analysis of explainable AI (XAI) literature to evaluate claims of human explainability. We collaborated with a professional librarian to identify 18,254 papers containing keywords related to explainability and interpretability. Of these, we find that only 253 papers included terms suggesting human involvement in evaluating an XAI technique, and just 128 of those conducted some form of a human study. In other words, fewer than 1% of XAI papers (0.7%) provide empirical evidence of human explainability when compared to the broader body of XAI literature. Our findings underscore a critical gap between claims of human explainability and evidence-based validation, raising concerns about the rigor of XAI research. We call for increased emphasis on human evaluations in XAI studies and provide our literature search methodology to enable both reproducibility and further investigation into this widespread issue.

Fewer Than 1% of Explainable AI Papers Validate Explainability with Humans

TL;DR

Explainability claims in AI are widespread but rarely validated with human users. The authors perform a librarian-assisted, large-scale Scopus audit of 18,254 XAI papers, identifying 253 with human-related claims and 128 with human studies, using a reproducible scoring scheme. They find that only of XAI papers provide empirical evidence from human evaluations, highlighting a major gap between claims and validation. The work urges a shift toward mandatory, human-centered evaluation in XAI and provides a transparent methodology to enable replication and further investigation.

Abstract

This late-breaking work presents a large-scale analysis of explainable AI (XAI) literature to evaluate claims of human explainability. We collaborated with a professional librarian to identify 18,254 papers containing keywords related to explainability and interpretability. Of these, we find that only 253 papers included terms suggesting human involvement in evaluating an XAI technique, and just 128 of those conducted some form of a human study. In other words, fewer than 1% of XAI papers (0.7%) provide empirical evidence of human explainability when compared to the broader body of XAI literature. Our findings underscore a critical gap between claims of human explainability and evidence-based validation, raising concerns about the rigor of XAI research. We call for increased emphasis on human evaluations in XAI studies and provide our literature search methodology to enable both reproducibility and further investigation into this widespread issue.

Paper Structure

This paper contains 11 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Distribution comparing all XAI literature from our Scopus search (Table \ref{['tab:scopus-search']}), including those we scored. 'All XAI papers' is the superset that encompasses all papers with keywords related to explainability, interpretability, etc. 'Claims about humans' papers are a subset of those that were filtered based on keywords related to human explainability. 'On topic' papers are a subset of those that we filtered to exclude meta reviews, surveys, etc. 'Validated' papers are a subset of those that provided empirical evidence of explainability. On the right a zoomed in version of the last three bars from the left figure are shown, note the change in x-axes.
  • Figure 2: Distribution of counts for human subjects involved in an evaluation, experiment, interview, etc. for validated XAI papers. It is important to note that 13 papers did not report their human subject count and 3 papers approximated their count. In the bottom figure, a subset of distribution counts up to N=152 and a bin size of 5.