An Evaluation of Explanation Methods for Black-Box Detectors of Machine-Generated Text

Loris Schoenegger; Yuxi Xia; Benjamin Roth

An Evaluation of Explanation Methods for Black-Box Detectors of Machine-Generated Text

Loris Schoenegger, Yuxi Xia, Benjamin Roth

TL;DR

The paper tackles the challenge of explaining black-box detectors that distinguish machine-generated from human text. It systematically evaluates three detectors using SHAP, LIME, and Anchor explanations across faithfulness, stability, and usefulness, employing automated tests (pointing game, token removal, continuity, contrastivity) and a user study. SHAP consistently shows superior faithfulness and stability, and yields the strongest user-performance signals, whereas LIME, though highly perceived as useful, underperforms in predicting detector behavior; Anchor sits in between with mixed results. The findings underscore the need to validate explanation methods beyond simple tasks and caution against assuming perceived usefulness aligns with actual explanatory value, guiding practitioners toward SHAP for this application while highlighting the importance of task-aware evaluation and UX considerations.

Abstract

The increasing difficulty to distinguish language-model-generated from human-written text has led to the development of detectors of machine-generated text (MGT). However, in many contexts, a black-box prediction is not sufficient, it is equally important to know on what grounds a detector made that prediction. Explanation methods that estimate feature importance promise to provide indications of which parts of an input are used by classifiers for prediction. However, these are typically evaluated with simple classifiers and tasks that are intuitive to humans. To assess their suitability beyond these contexts, this study conducts the first systematic evaluation of explanation quality for detectors of MGT. The dimensions of faithfulness and stability are evaluated with five automated experiments, and usefulness is assessed in a user study. We use a dataset of ChatGPT-generated and human-written documents, and pair predictions of three existing language-model-based detectors with the corresponding SHAP, LIME, and Anchor explanations. We find that SHAP performs best in terms of faithfulness, stability, and in helping users to predict the detector's behavior. In contrast, LIME, perceived as most useful by users, scores the worst in terms of user performance at predicting detector behavior.

An Evaluation of Explanation Methods for Black-Box Detectors of Machine-Generated Text

TL;DR

Abstract

Paper Structure (44 sections, 6 equations, 16 figures, 10 tables)

This paper contains 44 sections, 6 equations, 16 figures, 10 tables.

Introduction
Related work
Evaluating faithfulness.
Evaluating stability.
Evaluating usefulness.
Methods: Explanation Quality Metrics for MGT Detectors
Evaluation of Faithfulness
Evaluation of Stability
Consistency.
Continuity.
Contrastivity.
$c_\textit{intra}$
Evaluation of Usefulness
Technical and Experimental Details
Dataset
...and 29 more sections

Figures (16)

Figure 1: Scores in the experiments. Min-max normalized for each respective metric.
Figure 2: Hybrid documents in the pointing game. The detector's prediction for a hybrid document $f(d^h_i)$ is compared against the ground truth of $t_{i,max}$ (the token with the highest importance score in the explanation). Sentences labeled with m or h originate from machine-generated or human-written documents respectively
Figure 3: Perturbation strategy for the contrastivity experiment.
Figure 4: User Study. Information shown for LIME in phase 3 (left); and in the annotation phases 2 and 4 (right).
Figure 5: Accuracy at $k$ tokens masked. A faithful explanation method should feature a steep decline (initially correct predictions, left) or steep incline (initially wrong predictions, right). Only SHAP explanations cover more than 10 tokens. Mean across all detectors, error bars at $\pm 1$ standard error.
...and 11 more figures

An Evaluation of Explanation Methods for Black-Box Detectors of Machine-Generated Text

TL;DR

Abstract

An Evaluation of Explanation Methods for Black-Box Detectors of Machine-Generated Text

Authors

TL;DR

Abstract

Table of Contents

Figures (16)