Table of Contents
Fetching ...

Testing the Limits of Truth Directions in LLMs

Angelos Poulis, Mark Crovella, Evimaria Terzi

Abstract

Large language models (LLMs) have been shown to encode truth of statements in their activation space along a linear truth direction. Previous studies have argued that these directions are universal in certain aspects, while more recent work has questioned this conclusion drawing on limited generalization across some settings. In this work, we identify a number of limits of truth-direction universality that have not been previously understood. We first show that truth directions are highly layer-dependent, and that a full understanding of universality requires probing at many layers in the model. We then show that truth directions depend heavily on task type, emerging in earlier layers for factual and later layers for reasoning tasks; they also vary in performance across levels of task complexity. Finally, we show that model instructions dramatically affect truth directions; simple correctness evaluation instructions significantly affect the generalization ability of truth probes. Our findings indicate that universality claims for truth directions are more limited than previously known, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.

Testing the Limits of Truth Directions in LLMs

Abstract

Large language models (LLMs) have been shown to encode truth of statements in their activation space along a linear truth direction. Previous studies have argued that these directions are universal in certain aspects, while more recent work has questioned this conclusion drawing on limited generalization across some settings. In this work, we identify a number of limits of truth-direction universality that have not been previously understood. We first show that truth directions are highly layer-dependent, and that a full understanding of universality requires probing at many layers in the model. We then show that truth directions depend heavily on task type, emerging in earlier layers for factual and later layers for reasoning tasks; they also vary in performance across levels of task complexity. Finally, we show that model instructions dramatically affect truth directions; simple correctness evaluation instructions significantly affect the generalization ability of truth probes. Our findings indicate that universality claims for truth directions are more limited than previously known, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.

Paper Structure

This paper contains 40 sections, 2 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: Layer dependence of truth directions. (a) In-domain test-set performance. (b) F0-probe cross-task generalization. (c) Cosine similarity of probes across layer pairs.
  • Figure 2: Effect of model instructions on the emergence of truth directions across layers. Light solid lines correspond to probes trained and evaluated on no-prompt activations; bold dashed lines correspond to probes trained and evaluated on ask-correct activations.
  • Figure 3: Probes trained on no-prompt activations do not generalize well to ask-correct activations. Light solid lines correspond to probes trained and evaluated on no-prompt activations; bold dashed lines correspond to the performance of no-prompt trained probes to ask-correct activations.
  • Figure 4: Effect of model instructions on generalization of truth directions. Rows correspond to the task used for probe training and columns to the task used for evaluation. Cross-task generalization of A1--A3 to F0--F2 is significantly improved in the ask-correct setting.
  • Figure 5: Two-dimensional projections of model activations extracted from layer 25 on the learned truth direction and on the direction of maximum variance in the orthogonal space.
  • ...and 17 more figures