Table of Contents
Fetching ...

Semantic Faithfulness and Entropy Production Measures to Tame Your LLM Demons and Manage Hallucinations

Igor Halperin

TL;DR

<1> This work tackles the challenge of evaluating LLM faithfulness to source contexts by introducing two unsupervised metrics: Semantic Faithfulness (SF), based on minimal KL divergence between topic-transition matrices under marginal constraints, and Semantic Entropy Production (SEP), grounded in stochastic thermodynamics. <2> The authors model QCA triplets as distributions over a latent topic space and frame the LLM as a bipartite information engine akin to Maxwell's demon, enabling a principled, interpretable evaluation of information flow from context to answer. <3> They develop efficient alternating-minimization algorithms to compute SF and SEP, provide a lower-bound derivation for SEP, and validate the framework on NVIDIA 10-K risk-factor triplets, showing SF captures semantic alignment while SEP captures thermodynamic efficiency and irreversibility, with SF and SEP offering complementary insights for hallucination control. <4> The results suggest that higher SF correlates with stronger contextual grounding and that jointly reporting SF and SEP can improve automated faithfulness assessment, answer selection, and prompt design in high-stakes domains. <5> The study lays groundwork for scalable, reference-free evaluation and motivates further exploration across larger datasets, retrieval-augmented generation, and multi-turn interactions.

Abstract

Evaluating faithfulness of Large Language Models (LLMs) to a given task is a complex challenge. We propose two new unsupervised metrics for faithfulness evaluation using insights from information theory and thermodynamics. Our approach treats an LLM as a bipartite information engine where hidden layers act as a Maxwell demon controlling transformations of context $C $ into answer $A$ via prompt $Q$. We model Question-Context-Answer (QCA) triplets as probability distributions over shared topics. Topic transformations from $C$ to $Q$ and $A$ are modeled as transition matrices ${\bf Q}$ and ${\bf A}$ encoding the query goal and actual result, respectively. Our semantic faithfulness (SF) metric quantifies faithfulness for any given QCA triplet by the Kullback-Leibler (KL) divergence between these matrices. Both matrices are inferred simultaneously via convex optimization of this KL divergence, and the final SF metric is obtained by mapping the minimal divergence onto the unit interval [0,1], where higher scores indicate greater faithfulness. Furthermore, we propose a thermodynamics-based semantic entropy production (SEP) metric in answer generation, and show that high faithfulness generally implies low entropy production. The SF and SEP metrics can be used jointly or separately for LLM evaluation and hallucination control. We demonstrate our framework on LLM summarization of corporate SEC 10-K filings.

Semantic Faithfulness and Entropy Production Measures to Tame Your LLM Demons and Manage Hallucinations

TL;DR

<1> This work tackles the challenge of evaluating LLM faithfulness to source contexts by introducing two unsupervised metrics: Semantic Faithfulness (SF), based on minimal KL divergence between topic-transition matrices under marginal constraints, and Semantic Entropy Production (SEP), grounded in stochastic thermodynamics. <2> The authors model QCA triplets as distributions over a latent topic space and frame the LLM as a bipartite information engine akin to Maxwell's demon, enabling a principled, interpretable evaluation of information flow from context to answer. <3> They develop efficient alternating-minimization algorithms to compute SF and SEP, provide a lower-bound derivation for SEP, and validate the framework on NVIDIA 10-K risk-factor triplets, showing SF captures semantic alignment while SEP captures thermodynamic efficiency and irreversibility, with SF and SEP offering complementary insights for hallucination control. <4> The results suggest that higher SF correlates with stronger contextual grounding and that jointly reporting SF and SEP can improve automated faithfulness assessment, answer selection, and prompt design in high-stakes domains. <5> The study lays groundwork for scalable, reference-free evaluation and motivates further exploration across larger datasets, retrieval-augmented generation, and multi-turn interactions.

Abstract

Evaluating faithfulness of Large Language Models (LLMs) to a given task is a complex challenge. We propose two new unsupervised metrics for faithfulness evaluation using insights from information theory and thermodynamics. Our approach treats an LLM as a bipartite information engine where hidden layers act as a Maxwell demon controlling transformations of context into answer via prompt . We model Question-Context-Answer (QCA) triplets as probability distributions over shared topics. Topic transformations from to and are modeled as transition matrices and encoding the query goal and actual result, respectively. Our semantic faithfulness (SF) metric quantifies faithfulness for any given QCA triplet by the Kullback-Leibler (KL) divergence between these matrices. Both matrices are inferred simultaneously via convex optimization of this KL divergence, and the final SF metric is obtained by mapping the minimal divergence onto the unit interval [0,1], where higher scores indicate greater faithfulness. Furthermore, we propose a thermodynamics-based semantic entropy production (SEP) metric in answer generation, and show that high faithfulness generally implies low entropy production. The SF and SEP metrics can be used jointly or separately for LLM evaluation and hallucination control. We demonstrate our framework on LLM summarization of corporate SEC 10-K filings.

Paper Structure

This paper contains 42 sections, 25 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: Scatter plot of Semantic Faithfulness ($\mathcal{F}_S$) versus Semantic Entropy Production (SEP) for $n=100$ simulated QCA triplets. The solid red line shows the linear regression fit (SEP $= -1.76 \cdot \mathcal{F}_S + 2.02$), while the dashed green line shows the naive approximation SEP $= 1/\mathcal{F}_S - 1$ from Eq. (\ref{['Sdot_F_S_naive']}).
  • Figure 2: Scatter plot of QCA triplets in the question entropy-Semantic Faithfulness plane. The global fit across both groups produces positive Pearson correlation ($r = 0.695$, $p = 0.026$), indicating that higher question entropy is associated with higher semantic faithfulness. Group A (red) exhibits broader variation in both $H(Q)$ and $\mathcal{F}_S$, while Group B (blue) shows tighter clustering.
  • Figure 3: Relationship between Semantic Faithfulness $\mathcal{F}_S$ and Semantic Entropy Production (SEP). The global correlation is negative ($r = -0.612$, $p = 0.060$), consistent with the expectation that higher faithfulness corresponds to lower entropy production. Group A (red, comprehensive questions) shows strong within-group negative correlation ($r = -0.804$), while Group B (blue, competitive questions) shows no significant within-group correlation ($r = 0.121$).
  • Figure 4: Thermodynamic decomposition of SEP showing dissipated heat $\overset{\bm .}{S}_m$ versus system entropy change $\overset{\bm .}{S} = H(A) - H(C)$. Group A (red) exhibits higher system entropy change and wider variation in dissipated heat, while Group B (blue) clusters at lower $\overset{\bm .}{S}$ values. Negative $\overset{\bm .}{S}_m$ values indicate that the LLM draws on its internal knowledge base to offset entropy production during answer generation.
  • Figure 5: Probability distributions over semantic topics for triplet A0. Left: Question distribution $p(Q)$ is sparse, concentrated on a few semantic clusters. Center: Context distribution $p(C)$ is more diffuse, covering many topics from the source document. Right: Answer distribution $p(A)$ shows intermediate sparsity, reflecting how the LLM selectively addresses topics from the context to answer the question. These distributions serve as inputs to the SF and SEP algorithms.
  • ...and 1 more figures