Table of Contents
Fetching ...

The Trilemma of Truth in Large Language Models

Germans Savcisens, Tina Eliassi-Rad

TL;DR

This work questions how truth is encoded and retrieved in large language models and demonstrates that common probing methods yield unreliable veracity signals. It introduces sAwMIL, a Sparse-Aware MIL framework combined with Conformal Prediction to classify statements into true, false, or neither by leveraging internal activations across all tokens in a statement. Across 16 open-source LLMs and three curated datasets, sAwMIL outperforms zero-shot prompting and prior probes, revealing that truth and falsehood are not simply opposite directions but occupy a shared, low-dimensional subspace, with a distinct neither signal indicating undefined veracity. The approach enables calibrated uncertainty estimates and transferable veracity directions, offering a more reliable foundation for interpreting LLM knowledge and guiding user interactions in real-world applications.

Abstract

The public often attributes human-like qualities to large language models (LLMs) and assumes they "know" certain things. In reality, LLMs encode information retained during training as internal probabilistic knowledge. This study examines existing methods for probing the veracity of that knowledge and identifies several flawed underlying assumptions. To address these flaws, we introduce sAwMIL (Sparse-Aware Multiple-Instance Learning), a multiclass probing framework that combines multiple-instance learning with conformal prediction. sAwMIL leverages internal activations of LLMs to classify statements as true, false, or neither. We evaluate sAwMIL across 16 open-source LLMs, including default and chat-based variants, on three new curated datasets. Our results show that (1) common probing methods fail to provide a reliable and transferable veracity direction and, in some settings, perform worse than zero-shot prompting; (2) truth and falsehood are not encoded symmetrically; and (3) LLMs encode a third type of signal that is distinct from both true and false.

The Trilemma of Truth in Large Language Models

TL;DR

This work questions how truth is encoded and retrieved in large language models and demonstrates that common probing methods yield unreliable veracity signals. It introduces sAwMIL, a Sparse-Aware MIL framework combined with Conformal Prediction to classify statements into true, false, or neither by leveraging internal activations across all tokens in a statement. Across 16 open-source LLMs and three curated datasets, sAwMIL outperforms zero-shot prompting and prior probes, revealing that truth and falsehood are not simply opposite directions but occupy a shared, low-dimensional subspace, with a distinct neither signal indicating undefined veracity. The approach enables calibrated uncertainty estimates and transferable veracity directions, offering a more reliable foundation for interpreting LLM knowledge and guiding user interactions in real-world applications.

Abstract

The public often attributes human-like qualities to large language models (LLMs) and assumes they "know" certain things. In reality, LLMs encode information retained during training as internal probabilistic knowledge. This study examines existing methods for probing the veracity of that knowledge and identifies several flawed underlying assumptions. To address these flaws, we introduce sAwMIL (Sparse-Aware Multiple-Instance Learning), a multiclass probing framework that combines multiple-instance learning with conformal prediction. sAwMIL leverages internal activations of LLMs to classify statements as true, false, or neither. We evaluate sAwMIL across 16 open-source LLMs, including default and chat-based variants, on three new curated datasets. Our results show that (1) common probing methods fail to provide a reliable and transferable veracity direction and, in some settings, perform worse than zero-shot prompting; (2) truth and falsehood are not encoded symmetrically; and (3) LLMs encode a third type of signal that is distinct from both true and false.

Paper Structure

This paper contains 40 sections, 23 equations, 14 figures, 17 tables, 6 algorithms.

Figures (14)

  • Figure 1: Overview of methods for probing veracity in LLMs. (A) In zero-shot prompting, a target statement is inserted into a structured prompt instructing the LLM to select an answer from a specific set of tokens. The LLM's prediction is based on the probabilities of these tokens. This method treats the model as a black box and examines its $\langle$input, output$\rangle$ pairs. (B) In representation-based probing, the analysis is done on the internal representations generated by intermediate decoders. (B1) The mean-difference probe marks2023geometry is a common method for determining the veracity of a statement based on the representation of the last token. This approach outputs probabilities for true or false statements, but cannot account for statements that lack a definitive truth value. (B2) Our probe, multiclass sparse aware MIL (sAwMIL), looks at the representation of every token in a statement and provides probabilities for three classes: true, false, and neither. Multiclass sAwMIL can account for cases when the LLM does not have any knowledge about the statement.
  • Figure 2: Mean performance of probing methods aggregated across 16 models and three datasets, along with the examples of confusion matrices. Each probe is evaluated under two settings: using only the representation of the last token and using predictions aggregated across the entire bag/sentence. For each probe, we aggregate the statistics using the best-performing layers. (A) Correlation criterion. Mean probe performance on the test split of the dataset on which each probe was trained. (B) Generalization criterion. Mean probe performance on datasets that were not used for training. Binary probes trained on the last token representation (MD+CP, TTPD+CP, and sPCA+CP) exhibit both lower performance and poorer generalization. The multiclass SVM+CP (also trained on the last token representation) achieves much better performance in A; however, performance degrades when evaluated on the full bag. Importantly, SVM+CP achieves lower generalization performance as compared to the sAwMIL. (C) Confusion matrix for the zero-shot prompting, where the LLM overpredicts the false class. (D) Confusion matrix for MD+CP evaluated on the last token representation, where probe incorrectly predicts neither-valued statements. (E) Confusion matrix for the multiclass sAwMIL evaluated on the full bag. This probe achieves better accuracy on the neither-valued statements.
  • Figure 3: Similarity between is-true and is-false directions across datasets and probes (values extracted from the best-performing layer and averaged across 16 LLMs). (A) Cosine similarity between the two directions. If the two directions were perfectly opposite, the cosine similarity would be closer to $-1$, indicating a single bidirectional axis of truth and falsehood. Instead, all probes exhibit lesser opposition, with the better-performing and more generalizable probe (sAwMIL) showing a smaller angle between is-true and is-false. (B) Spearman correlation between scores predicted along the two directions (evaluated only on true and false statements). Perfectly bidirectional representations would yield a strong negative correlation: a high score on is-true implies a low score in is-false. However, the correlation does not approach $-1$, indicating partial rather than inverse coupling. Together, the two panels show that is-true and is-false directions share some structure but are not strict opposites, spanning a multidimensional subspace rather than a single axis.
  • Figure 4: Zero-shot prompt templates. We use these templates in our experiments. Panel A--B: The prompts for the default models. Panel D-F: Examples of prompts used for the chat models. Note that chat models like Gemma do not have a context (or system) prompt; hence, we provide instructions in the first message (see Panel F). In the main manuscript, we report the performance over the original instructions -- i.e., instructions in Panels A and D. For the chat LLMs without the context prompt, we apply the two-message template in Panel F, but use the original instructions. (Side note: The statement used in these examples is factually correct.)
  • Figure 5: Performance of zero-shot prompting on the City Locations data set across different models and instruction phrasings. We use the Weighted Matthew's Correlation Coefficient (W-MCC) to quantify the performance. The marker shows the mean value and the error bars show the $95\%$ confidence intervals (based on the bootstrapping with $n=$ 1,000 bootstrap samples). Minimal changes to the prompt instructions can skew the performance of zero-shot prompting. Chat models exhibit the highest performance across all instruction phrasings. However, the default Qwen models match the performance of other chat-based models. Shuffled instructions appear to lead to worse performance in chat models. We expected that the phrasings would have only a minor effect on their performance. The default Gemma and Mistral models seem to fail (their performance is around $0$).
  • ...and 9 more figures