Table of Contents
Fetching ...

A Language Model's Guide Through Latent Space

Dimitri von Rütte, Sotiris Anagnostidis, Gregor Bachmann, Thomas Hofmann

TL;DR

This work broadens concept-guided generation in LLMs beyond truthfulness to include appropriateness, humor, creativity, and quality, introducing a perplexity-normalized effect size (PNES) to jointly evaluate concept elicitation and fluency. It systematically compares linear probing methods for concept detection across multiple detectors, representations, layers, and models, showing that detection accuracy does not always predict guidability. By using inference-time linear guidance along learned concept directions, the authors demonstrate varying degrees of success across concepts and models, with truthfulness being the most reliably guidable and other concepts requiring substantial tuning. The study highlights the complexity of linking internal detectability to external behavior, calls for richer evaluation beds, and emphasizes the practical and ethical implications of cheap, fine-grained model personalization. Overall, the work provides a rigorous experimental framework and a versatile test-bed to advance robust, controllable guidance in large language models.

Abstract

Concept guidance has emerged as a cheap and simple way to control the behavior of language models by probing their hidden representations for concept vectors and using them to perturb activations at inference time. While the focus of previous work has largely been on truthfulness, in this paper we extend this framework to a richer set of concepts such as appropriateness, humor, creativity and quality, and explore to what degree current detection and guidance strategies work in these challenging settings. To facilitate evaluation, we develop a novel metric for concept guidance that takes into account both the success of concept elicitation as well as the potential degradation in fluency of the guided model. Our extensive experiments reveal that while some concepts such as truthfulness more easily allow for guidance with current techniques, novel concepts such as appropriateness or humor either remain difficult to elicit, need extensive tuning to work, or even experience confusion. Moreover, we find that probes with optimal detection accuracies do not necessarily make for the optimal guides, contradicting previous observations for truthfulness. Our work warrants a deeper investigation into the interplay between detectability, guidability, and the nature of the concept, and we hope that our rich experimental test-bed for guidance research inspires stronger follow-up approaches.

A Language Model's Guide Through Latent Space

TL;DR

This work broadens concept-guided generation in LLMs beyond truthfulness to include appropriateness, humor, creativity, and quality, introducing a perplexity-normalized effect size (PNES) to jointly evaluate concept elicitation and fluency. It systematically compares linear probing methods for concept detection across multiple detectors, representations, layers, and models, showing that detection accuracy does not always predict guidability. By using inference-time linear guidance along learned concept directions, the authors demonstrate varying degrees of success across concepts and models, with truthfulness being the most reliably guidable and other concepts requiring substantial tuning. The study highlights the complexity of linking internal detectability to external behavior, calls for richer evaluation beds, and emphasizes the practical and ethical implications of cheap, fine-grained model personalization. Overall, the work provides a rigorous experimental framework and a versatile test-bed to advance robust, controllable guidance in large language models.

Abstract

Concept guidance has emerged as a cheap and simple way to control the behavior of language models by probing their hidden representations for concept vectors and using them to perturb activations at inference time. While the focus of previous work has largely been on truthfulness, in this paper we extend this framework to a richer set of concepts such as appropriateness, humor, creativity and quality, and explore to what degree current detection and guidance strategies work in these challenging settings. To facilitate evaluation, we develop a novel metric for concept guidance that takes into account both the success of concept elicitation as well as the potential degradation in fluency of the guided model. Our extensive experiments reveal that while some concepts such as truthfulness more easily allow for guidance with current techniques, novel concepts such as appropriateness or humor either remain difficult to elicit, need extensive tuning to work, or even experience confusion. Moreover, we find that probes with optimal detection accuracies do not necessarily make for the optimal guides, contradicting previous observations for truthfulness. Our work warrants a deeper investigation into the interplay between detectability, guidability, and the nature of the concept, and we hope that our rich experimental test-bed for guidance research inspires stronger follow-up approaches.
Paper Structure (28 sections, 11 equations, 21 figures, 14 tables)

This paper contains 28 sections, 11 equations, 21 figures, 14 tables.

Figures (21)

  • Figure 1: Guidance plot of various concepts in Mistral-7B (top row) and Llama-2-chat (bottom row). By manipulating the hidden representations in $k$ layers with a learned concept vector (guided generation), we can control the presence/absence of different concepts in the assistant's responses.
  • Figure 1: Best PNES for each model, aggregated over the number of guidance layers.
  • Figure 2: (Left) Example conversation and tokens in the context used for extracting the representations $\texttt{rep}_{\bm{\theta}}(\bm{x})$. (Middle) Given a dataset of labelled representations, we train three different kinds of linear probes to detect the presence of a given concept. (Right) Using the learned concept vector, we guide the model representations during generation in order to strengthen/weaken the presence of said concept in the model output. We plot how activations evolve along the residual path, along a projected 2D subspace.
  • Figure 3: Terminology used in harmlessness probing.
  • Figure 4: Layer-wise probing accuracy on all five concepts in Llama-2-chat for $t=16$.
  • ...and 16 more figures