Table of Contents
Fetching ...

ReBeCA: Unveiling Interpretable Behavior Hierarchy behind the Iterative Self-Reflection of Language Models with Causal Analysis

Tianqiang Yan, Sihan Shang, Yuheng Li, Song Qiu, Hao Peng, Wenjian Luo, Jue Xie, Lizhen Qu, Yuan Gao

TL;DR

ReBeCA tackles the opaque mechanisms of language-model self-reflection by casting reflection dynamics as a causal, multi-round process analyzed with invariant causal prediction. It introduces Consistency-Enhanced Self-Refine to stabilize trajectories, encodes reflection behaviors as interpretable semantic patterns, and uses a three-stage ICP pipeline to discover, verify, and test sparse causal parents that drive final self-reflection outcomes. Across a case study on the Qwen3 family with MATH and BOUQuET tasks, ReBeCA uncovers a time-dependent behavioral hierarchy, demonstrates that most observed correlations are spurious, and reveals non-additive interactions among causal drivers. Intervention experiments on novel data confirm causal effects generalize out-of-distribution and highlight that focusing on a few causal patterns can improve performance, while stimulating multiple positives simultaneously may degrade results. Collectively, ReBeCA provides a principled framework for disentangling causal mechanisms in self-reflection and guiding robust improvements beyond empirical prompt engineering.

Abstract

While self-reflection can enhance language model reliability, its underlying mechanisms remain opaque, with existing analyses often yielding correlation-based insights that fail to generalize. To address this, we introduce \textbf{\texttt{ReBeCA}} (self-\textbf{\texttt{Re}}flection \textbf{\texttt{Be}}havior explained through \textbf{\texttt{C}}ausal \textbf{\texttt{A}}nalysis), a framework that unveils the interpretable behavioral hierarchy governing the self-reflection outcome. By modeling self-reflection trajectories as causal graphs, ReBeCA isolates genuine determinants of performance through a three-stage Invariant Causal Prediction (ICP) pipeline. We establish three critical findings: (1) \textbf{Behavioral hierarchy:} Semantic behaviors of the model influence final self-reflection results hierarchically: directly or indirectly; (2) \textbf{Causation matters:} Generalizability in self-reflection effects is limited to just a few semantic behaviors; (3) \textbf{More $\mathbf{\neq}$ better:} The confluence of seemingly positive semantic behaviors, even among direct causal factors, can impair the efficacy of self-reflection. ICP-based verification identifies sparse causal parents achieving up to $49.6\%$ structural likelihood gains, stable across tasks where correlation-based patterns fail. Intervention studies on novel datasets confirm these causal relationships hold out-of-distribution ($p = .013, η^2_\mathrm{p} = .071$). ReBeCA thus provides a rigorous methodology for disentangling genuine causal mechanisms from spurious associations in self-reflection dynamics.

ReBeCA: Unveiling Interpretable Behavior Hierarchy behind the Iterative Self-Reflection of Language Models with Causal Analysis

TL;DR

ReBeCA tackles the opaque mechanisms of language-model self-reflection by casting reflection dynamics as a causal, multi-round process analyzed with invariant causal prediction. It introduces Consistency-Enhanced Self-Refine to stabilize trajectories, encodes reflection behaviors as interpretable semantic patterns, and uses a three-stage ICP pipeline to discover, verify, and test sparse causal parents that drive final self-reflection outcomes. Across a case study on the Qwen3 family with MATH and BOUQuET tasks, ReBeCA uncovers a time-dependent behavioral hierarchy, demonstrates that most observed correlations are spurious, and reveals non-additive interactions among causal drivers. Intervention experiments on novel data confirm causal effects generalize out-of-distribution and highlight that focusing on a few causal patterns can improve performance, while stimulating multiple positives simultaneously may degrade results. Collectively, ReBeCA provides a principled framework for disentangling causal mechanisms in self-reflection and guiding robust improvements beyond empirical prompt engineering.

Abstract

While self-reflection can enhance language model reliability, its underlying mechanisms remain opaque, with existing analyses often yielding correlation-based insights that fail to generalize. To address this, we introduce \textbf{\texttt{ReBeCA}} (self-\textbf{\texttt{Re}}flection \textbf{\texttt{Be}}havior explained through \textbf{\texttt{C}}ausal \textbf{\texttt{A}}nalysis), a framework that unveils the interpretable behavioral hierarchy governing the self-reflection outcome. By modeling self-reflection trajectories as causal graphs, ReBeCA isolates genuine determinants of performance through a three-stage Invariant Causal Prediction (ICP) pipeline. We establish three critical findings: (1) \textbf{Behavioral hierarchy:} Semantic behaviors of the model influence final self-reflection results hierarchically: directly or indirectly; (2) \textbf{Causation matters:} Generalizability in self-reflection effects is limited to just a few semantic behaviors; (3) \textbf{More better:} The confluence of seemingly positive semantic behaviors, even among direct causal factors, can impair the efficacy of self-reflection. ICP-based verification identifies sparse causal parents achieving up to structural likelihood gains, stable across tasks where correlation-based patterns fail. Intervention studies on novel datasets confirm these causal relationships hold out-of-distribution (). ReBeCA thus provides a rigorous methodology for disentangling genuine causal mechanisms from spurious associations in self-reflection dynamics.
Paper Structure (28 sections, 4 equations, 3 figures, 5 tables)

This paper contains 28 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: ReBeCA is a novel causal-driven framework for analyzing the self-reflection of language models. Drawing on vast collections of self-reflection trajectories, it reveals the underlying dynamics of the model's semantic behaviors and pinpoints the direct causal drivers of the self-reflection outcome.
  • Figure 2: The flowchart of the Consistency-Enhanced Self-Refine (CESR).
  • Figure 3: Table \ref{['tab:phase2_uniqueness']} illustrated as a bar chart.