Table of Contents
Fetching ...

Spilled Energy in Large Language Models

Adrian Robert Minut, Hazem Dewidar, Iacopo Masi

TL;DR

This work reinterprets the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference, and introduces two completely training-free metrics derived directly from output logits: spilled energy and marginalized energy.

Abstract

We reinterpret the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference. This principled approach allows us to track "energy spills" during decoding, which we empirically show correlate with factual errors, biases, and failures. Similar to Orgad et al. (2025), our method localizes the exact answer token and subsequently tests for hallucinations. Crucially, however, we achieve this without requiring trained probe classifiers or activation ablations. Instead, we introduce two completely training-free metrics derived directly from output logits: spilled energy, which captures the discrepancy between energy values across consecutive generation steps that should theoretically match, and marginalized energy, which is measurable at a single step. Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalization. Notably, these results hold for both pretrained and instruction-tuned variants without introducing any training overhead.

Spilled Energy in Large Language Models

TL;DR

This work reinterprets the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference, and introduces two completely training-free metrics derived directly from output logits: spilled energy and marginalized energy.

Abstract

We reinterpret the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference. This principled approach allows us to track "energy spills" during decoding, which we empirically show correlate with factual errors, biases, and failures. Similar to Orgad et al. (2025), our method localizes the exact answer token and subsequently tests for hallucinations. Crucially, however, we achieve this without requiring trained probe classifiers or activation ablations. Instead, we introduce two completely training-free metrics derived directly from output logits: spilled energy, which captures the discrepancy between energy values across consecutive generation steps that should theoretically match, and marginalized energy, which is measurable at a single step. Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalization. Notably, these results hold for both pretrained and instruction-tuned variants without introducing any training overhead.
Paper Structure (32 sections, 23 equations, 11 figures, 5 tables)

This paper contains 32 sections, 23 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Color-coded comparison of hallucination detection with LLaMa-3 8B using logit confidence and spilled energy. Our method generalizes well across topics (e.g., Q&A, reasoning) and diverse LLMs. ✓ indicates a correct answer and ✗ an incorrect one. While our approach focuses on the exact answer tokens (e.g. Rome/Sydney and 120/470, see \ref{['sec:detect-hallucinations']}), here we apply min–max normalization to the full answer for visualization, as truthful hallucination.
  • Figure 2: How energy spills in LLMs. (a) Language Modeling $p(\mathbf{x}_{i:1})$ is attained as a decomposition problem following the chain rule of probability, implemented as autoregressive: we recursively apply a discriminative classifier over the vocabulary $\mathcal{V}$ to attain generative modeling with larger context size i.e. $p(\mathbf{x}_{i}|\mathbf{x}_{i-1:1})$. (b) We reinterpret each discriminative classifier as a generative EBM, finding a connection between two quantities that should be the same across time steps yet are different. We call this difference "the spilled energy" $\Delta E_{\boldsymbol{\theta}}(\mathbf{x}_{i:1})$ in \ref{['def:spilled-energy']}. (c) Given that we simply read values inside the LLM, our approach is training-free and correlates well with hallucinations on a synthetic math dataset with increasing difficulty; (d) histograms of spilled energy values, for incorrect and correct answers on all nine datasets using $\min$ pooling for Llama-3-Instruct. The two distributions are easily separable by using a simple threshold, resulting in a generalization across real-world tasks.
  • Figure 3: Histograms of Spilled Energy values across models (rows) on Math Sums with different error ranges in the answer (columns, decreasing range left to right, making it harder to detect errors). All sums are performed on 13-digit integers. In the fourth column, we show ROC curves for Hallucination Detection across the error ranges (colors) and methods (line styles).
  • Figure 4: (a) AuROC performance as percentages of probing classifiers on exact answer tokens by orgad2024llms for LlaMA-3-Instruct. (b) depicts the performance difference between our Spilled $\Delta E$ with Min pooling and theirs. Positive values indicate cases where Spilled $\Delta E$ outperforms orgad2024llms. This comparison highlights the generalization capabilities of our method, compared to probing classifiers. Legend: low performance high performance.
  • Figure 5: Histograms of Spilled Energy values across models (rows) on Math Sums with different error ranges in the answer (columns, decreasing range left to right, making it harder to detect errors), as described in \ref{['sec:synth-arithmetic']}. In the fourth column, we show ROC curves for Hallucination Detection across the error ranges (colors) and methods (line styles).
  • ...and 6 more figures

Theorems & Definitions (1)

  • Definition 4.1: Spilled Energy $\Delta E_{\boldsymbol{\theta}}(\mathbf{x}_{i:1})$