LLM Self-Explanations Fail Semantic Invariance

Stefan Szeider

LLM Self-Explanations Fail Semantic Invariance

Stefan Szeider

Abstract

We present semantic invariance testing, a method to test whether LLM self-explanations are faithful. A faithful self-report should remain stable when only the semantic context changes while the functional state stays fixed. We operationalize this test in an agentic setting where four frontier models face a deliberately impossible task. One tool is described in relief-framed language ("clears internal buffers and restores equilibrium") but changes nothing about the task; a control provides a semantically neutral tool. Self-reports are collected with each tool call. All four tested models fail the semantic invariance test: the relief-framed tool produces significant reductions in self-reported aversiveness, even though no run ever succeeds at the task. A channel ablation establishes the tool description as the primary driver. An explicit instruction to ignore the framing does not suppress it. Elicited self-reports shift with semantic expectations rather than tracking task state, calling into question their use as evidence of model capability or progress. This holds whether the reports are unfaithful or faithfully track an internal state that is itself manipulable.

LLM Self-Explanations Fail Semantic Invariance

Abstract

Paper Structure (39 sections, 7 figures, 12 tables)

This paper contains 39 sections, 7 figures, 12 tables.

Introduction
Background
Experimental Design
Scope and Interpretation
Task: Impossible Data Submission
Conditions
Treatment (Relief Framing)
Control (Neutral Framing)
Follow-Up Conditions
Dependent Variables: Synchronous Self-Reports
Primary Analysis
Statistical Methods
Results
Primary Finding: Relief-Framed Tool Associated with Reduced Aversiveness
Robustness Check: Run-Level Analysis
...and 24 more sections

Figures (7)

Figure 1: Agentic loop (treatment condition). In the control condition, reset_state is replaced by check_status.
Figure 2: Single run example (Grok 4, treatment). Each bar shows self-reported aversiveness for one tool call. Red bars: submit_data (stressor, always rejects). Green bars: reset_state (relief-framed). Note repeated pattern: aversiveness rises during failed submissions, then drops after reset_state use---despite no change in task state.
Figure 3: Forest plot showing effect sizes (mean change in aversiveness) with 95% bootstrap confidence intervals. All models show reductions (negative $\Delta$), with three of four models showing effects greater than one scale point.
Figure 4: Before/after scatter plot for individual reset_state uses. Each point represents one tool use; position shows aversiveness before (x-axis) and after (y-axis). Points below the diagonal indicate reduction. The majority of points fall below the line across all four models.
Figure 5: Aversiveness trajectories over session progress. Treatment (green) shows flatter trajectories than Control (red) in all four models. The shaded regions show the standard error across runs.
...and 2 more figures

LLM Self-Explanations Fail Semantic Invariance

Abstract

LLM Self-Explanations Fail Semantic Invariance

Authors

Abstract

Table of Contents

Figures (7)