The Trilemma of Truth in Large Language Models
Germans Savcisens, Tina Eliassi-Rad
TL;DR
This work questions how truth is encoded and retrieved in large language models and demonstrates that common probing methods yield unreliable veracity signals. It introduces sAwMIL, a Sparse-Aware MIL framework combined with Conformal Prediction to classify statements into true, false, or neither by leveraging internal activations across all tokens in a statement. Across 16 open-source LLMs and three curated datasets, sAwMIL outperforms zero-shot prompting and prior probes, revealing that truth and falsehood are not simply opposite directions but occupy a shared, low-dimensional subspace, with a distinct neither signal indicating undefined veracity. The approach enables calibrated uncertainty estimates and transferable veracity directions, offering a more reliable foundation for interpreting LLM knowledge and guiding user interactions in real-world applications.
Abstract
The public often attributes human-like qualities to large language models (LLMs) and assumes they "know" certain things. In reality, LLMs encode information retained during training as internal probabilistic knowledge. This study examines existing methods for probing the veracity of that knowledge and identifies several flawed underlying assumptions. To address these flaws, we introduce sAwMIL (Sparse-Aware Multiple-Instance Learning), a multiclass probing framework that combines multiple-instance learning with conformal prediction. sAwMIL leverages internal activations of LLMs to classify statements as true, false, or neither. We evaluate sAwMIL across 16 open-source LLMs, including default and chat-based variants, on three new curated datasets. Our results show that (1) common probing methods fail to provide a reliable and transferable veracity direction and, in some settings, perform worse than zero-shot prompting; (2) truth and falsehood are not encoded symmetrically; and (3) LLMs encode a third type of signal that is distinct from both true and false.
