An Evaluation of Explanation Methods for Black-Box Detectors of Machine-Generated Text
Loris Schoenegger, Yuxi Xia, Benjamin Roth
TL;DR
The paper tackles the challenge of explaining black-box detectors that distinguish machine-generated from human text. It systematically evaluates three detectors using SHAP, LIME, and Anchor explanations across faithfulness, stability, and usefulness, employing automated tests (pointing game, token removal, continuity, contrastivity) and a user study. SHAP consistently shows superior faithfulness and stability, and yields the strongest user-performance signals, whereas LIME, though highly perceived as useful, underperforms in predicting detector behavior; Anchor sits in between with mixed results. The findings underscore the need to validate explanation methods beyond simple tasks and caution against assuming perceived usefulness aligns with actual explanatory value, guiding practitioners toward SHAP for this application while highlighting the importance of task-aware evaluation and UX considerations.
Abstract
The increasing difficulty to distinguish language-model-generated from human-written text has led to the development of detectors of machine-generated text (MGT). However, in many contexts, a black-box prediction is not sufficient, it is equally important to know on what grounds a detector made that prediction. Explanation methods that estimate feature importance promise to provide indications of which parts of an input are used by classifiers for prediction. However, these are typically evaluated with simple classifiers and tasks that are intuitive to humans. To assess their suitability beyond these contexts, this study conducts the first systematic evaluation of explanation quality for detectors of MGT. The dimensions of faithfulness and stability are evaluated with five automated experiments, and usefulness is assessed in a user study. We use a dataset of ChatGPT-generated and human-written documents, and pair predictions of three existing language-model-based detectors with the corresponding SHAP, LIME, and Anchor explanations. We find that SHAP performs best in terms of faithfulness, stability, and in helping users to predict the detector's behavior. In contrast, LIME, perceived as most useful by users, scores the worst in terms of user performance at predicting detector behavior.
