Table of Contents
Fetching ...

Latent Debate: A Surrogate Framework for Interpreting LLM Thinking

Lihu Chen, Xiang Yin, Francesca Toni

TL;DR

This work introduces latent debate, a model-agnostic framework that interprets LLM thinking through internal latent arguments, an interpreter, and a symbolic thinking module based on QBAFs. It demonstrates a faithful symbolic instantiation for True/False predictions, showing high fidelity to the original models and offering a fast, training-free surrogate. Additionally, latent debate enables hallucination detection by extracting debate-pattern features and using SHAP analyses to link internal conflicts—especially mid-layer debates—to hallucinations. The approach opens avenues for diagnosing and mitigating internal disagreement-driven errors in large language models. Its significance lies in providing an interpretable, structurally faithful window into LLM reasoning and a practical baseline for hallucination monitoring.

Abstract

Understanding the internal thinking process of Large Language Models (LLMs) and the cause of hallucinations remains a key challenge. To this end, we introduce latent debate, a novel framework for interpreting model predictions through the lens of implicit internal arguments. Unlike the current work of self-consistency and multi-agent debate, which relies on explicit debates among multiple answers or multiple models, latent debate captures the hidden supporting and attacking signals that arise within a single model during a single inference. We first present a model- and task-agnostic conceptual framework, and then instantiate it symbolically to approximate the thinking process of LLMs on True/False prediction tasks. Empirical studies demonstrate that latent debate is a faithful structured surrogate model that has highly consistent predictions with the original LLM. Beyond interpretability, we demonstrate that latent debate provides a strong baseline for hallucination detection. Further analysis reveals strong correlations between hallucinations and debate patterns, such as a high degree of latent debates in the middle layers is linked to a higher risk of hallucinations. These findings position latent debate as a potential framework for understanding internal mechanisms of LLMs, especially for scenarios where internal (dis)agreements appear during the inference steps.

Latent Debate: A Surrogate Framework for Interpreting LLM Thinking

TL;DR

This work introduces latent debate, a model-agnostic framework that interprets LLM thinking through internal latent arguments, an interpreter, and a symbolic thinking module based on QBAFs. It demonstrates a faithful symbolic instantiation for True/False predictions, showing high fidelity to the original models and offering a fast, training-free surrogate. Additionally, latent debate enables hallucination detection by extracting debate-pattern features and using SHAP analyses to link internal conflicts—especially mid-layer debates—to hallucinations. The approach opens avenues for diagnosing and mitigating internal disagreement-driven errors in large language models. Its significance lies in providing an interpretable, structurally faithful window into LLM reasoning and a practical baseline for hallucination monitoring.

Abstract

Understanding the internal thinking process of Large Language Models (LLMs) and the cause of hallucinations remains a key challenge. To this end, we introduce latent debate, a novel framework for interpreting model predictions through the lens of implicit internal arguments. Unlike the current work of self-consistency and multi-agent debate, which relies on explicit debates among multiple answers or multiple models, latent debate captures the hidden supporting and attacking signals that arise within a single model during a single inference. We first present a model- and task-agnostic conceptual framework, and then instantiate it symbolically to approximate the thinking process of LLMs on True/False prediction tasks. Empirical studies demonstrate that latent debate is a faithful structured surrogate model that has highly consistent predictions with the original LLM. Beyond interpretability, we demonstrate that latent debate provides a strong baseline for hallucination detection. Further analysis reveals strong correlations between hallucinations and debate patterns, such as a high degree of latent debates in the middle layers is linked to a higher risk of hallucinations. These findings position latent debate as a potential framework for understanding internal mechanisms of LLMs, especially for scenarios where internal (dis)agreements appear during the inference steps.

Paper Structure

This paper contains 36 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Visualizations of our latent debate for two claims (We use the last few tokens of Llama-8B) . Red cells represent attacking arguments, while blue cells represent supporting arguments. More controversy leads to hallucination.
  • Figure 2: The overall framework of latent debate. Given an input claim, our method generates a set of latent arguments, i.e., model components (raw latent signals) that convey the model's opinions toward the claim. These arguments are then processed by the argument interpreter, identifying the arguments' supporting or attacking stance towards the claim. The resulting attacking and supporting arguments are fed into the thinking module, which applies a procedure to reach the final decision.
  • Figure 3: QBAFs and LLMs. (a) An example QBAF showing how initial strengths ($\tau$) of arguments $n_1, \ldots, n_4$ are transformed through gradual semantics based on attacking (-) and supporting (+) relations to produce final strengths ($\sigma$). (b) Skeleton of a QBAF drawn from an LLM architecture, where each node represents a specific token at a specific layer. To obtain a QBAF, the directed edges need to become attacks or support and the nodes need to be equipped with an initial strength.
  • Figure 4: (a) Average feature importance highlights which debate features most strongly influence hallucinated outputs. (b) Feature importance across layer regions (Lower, MIddle, Upper) and feature types (NumAtk=number of attacks, VarFin=average of initial strength, AvgFin=verage of final strength, VarInit=variance of initial strength, and AvgInit=ariance of final strength.)
  • Figure A1: 3D surface plot of the semantics function $\sigma$.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1: QBAF
  • Proof 1
  • Example 1