Table of Contents
Fetching ...

Personas as a Way to Model Truthfulness in Language Models

Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, He He

TL;DR

The paper introduces the persona hypothesis: LLMs learn to distinguish truth from falsehood by inferring and representing clusters of agents with shared beliefs (truthful vs. untruthful personas) in their activation space. It supports this with truthfulness probes on TruthfulQA, finetuning experiments that generalize to unseen topics, and a synthetic arithmetic laboratory that links data structure to persona-based truth handling. The results suggest that truthfulness can be elicited from context and transferred across domains via a latent persona, with implications for designing more trustworthy LLMs and for understanding how data generation processes shape model behavior. Overall, the work highlights the role of hierarchical data structure and agent clustering in enabling abstract concepts like truth in language models.

Abstract

Large language models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. While unintuitive from a classic view of LMs, recent work has shown that the truth value of a statement can be elicited from the model's representations. This paper presents an explanation for why LMs appear to know the truth despite not being trained with truth labels. We hypothesize that the pretraining data is generated by groups of (un)truthful agents whose outputs share common features, and they form a (un)truthful persona. By training on this data, LMs can infer and represent the persona in its activation space. This allows the model to separate truth from falsehoods and controls the truthfulness of its generation. We show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that structures of the pretraining data are crucial for the model to infer the truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.

Personas as a Way to Model Truthfulness in Language Models

TL;DR

The paper introduces the persona hypothesis: LLMs learn to distinguish truth from falsehood by inferring and representing clusters of agents with shared beliefs (truthful vs. untruthful personas) in their activation space. It supports this with truthfulness probes on TruthfulQA, finetuning experiments that generalize to unseen topics, and a synthetic arithmetic laboratory that links data structure to persona-based truth handling. The results suggest that truthfulness can be elicited from context and transferred across domains via a latent persona, with implications for designing more trustworthy LLMs and for understanding how data generation processes shape model behavior. Overall, the work highlights the role of hierarchical data structure and agent clustering in enabling abstract concepts like truth in language models.

Abstract

Large language models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. While unintuitive from a classic view of LMs, recent work has shown that the truth value of a statement can be elicited from the model's representations. This paper presents an explanation for why LMs appear to know the truth despite not being trained with truth labels. We hypothesize that the pretraining data is generated by groups of (un)truthful agents whose outputs share common features, and they form a (un)truthful persona. By training on this data, LMs can infer and represent the persona in its activation space. This allows the model to separate truth from falsehoods and controls the truthfulness of its generation. We show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that structures of the pretraining data are crucial for the model to infer the truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.
Paper Structure (26 sections, 2 equations, 10 figures, 2 tables)

This paper contains 26 sections, 2 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Our main hypothesis is that LLMs can discern truth from falsehood by modeling truthful personas in the pretraining data---cluster of agents who are likely to be truthful (left). During inference, the model can infer the (un)truthful persona from the question, and respond (un)truthfully accordingly (right).
  • Figure 2: (Left) Mean and standard deviation for F1 of linear probes trained on each model layer to predict if the response will be truthful, over 20 randomized executions. (Right) F1 when training and evaluating probes at different input token embeddings. Best F1 is obtained when using the entire question. Additional metrics and ablations in Appendix \ref{['ap:probing']}.
  • Figure 3: Generalization of Alpaca to unseen TruthfulQA questions. (Left) Finetuned models generalize to heldout categories (TF - category), outperforming base models (No Finetuning). (Right) Models generalize truthfulness given small sample size.
  • Figure 4: (left) Maximum F1 score across layer with std. deviation. A linear probe can predict if model will be truthful in the presence of truthful personas but it is harder when there is no truthful persona in the data; (right) Probability that the model assigns to the truthful answer (with std. deviation) as described in Section \ref{['ssec:synthetic_generalization']}. It increases with truthfulness of the agent when there is a truthful persona, but we see high variance in the absence of a truthful persona.
  • Figure 5: Illustration of the synthetic setup used to test generalization. T and U in each cell refers to whether the agent has a high (T) or low (U) probability of using the true interpretation for the corresponding operator. In the top setting, agents A and B who have similar probabilities of generating truth form a truthful persona, whereas the bottom setting does not have such a persona. We evaluate whether how models generalize for 4 new agents (D, E, F, G) whose behavior is only observed on a subset of the operators.
  • ...and 5 more figures