Table of Contents
Fetching ...

Representation Engineering for Large-Language Models: Survey and Research Challenges

Lukasz Bartoszcze, Sarthak Munshi, Bryan Sukidi, Jennifer Yen, Zejia Yang, David Williams-King, Linh Le, Kosi Asuzu, Carsten Maple

TL;DR

This survey maps Representation Engineering for LLMs into Reading and Control, arguing that high-level concepts are encoded in latent subspaces that can be read and edited without full retraining. It develops a taxonomy of linear and optimized steering vectors, input-contrast methods, and dynamic strength strategies, supported by theoretical notions like the Linear Representation Hypothesis and the Superposition Hypothesis. The work systematically compares RepE to prompt-engineering, fine-tuning, and mechanistic interpretability, and discusses evaluation pipelines, open problems, and ethical considerations. Its findings underscore the potential for inference-time control to achieve personalized, safe, and high-performing LLMs, while highlighting standardization, generalization, and multimodal challenges that must be addressed to deploy RepE broadly.

Abstract

Large-language models are capable of completing a variety of tasks, but remain unpredictable and intractable. Representation engineering seeks to resolve this problem through a new approach utilizing samples of contrasting inputs to detect and edit high-level representations of concepts such as honesty, harmfulness or power-seeking. We formalize the goals and methods of representation engineering to present a cohesive picture of work in this emerging discipline. We compare it with alternative approaches, such as mechanistic interpretability, prompt-engineering and fine-tuning. We outline risks such as performance decrease, compute time increases and steerability issues. We present a clear agenda for future research to build predictable, dynamic, safe and personalizable LLMs.

Representation Engineering for Large-Language Models: Survey and Research Challenges

TL;DR

This survey maps Representation Engineering for LLMs into Reading and Control, arguing that high-level concepts are encoded in latent subspaces that can be read and edited without full retraining. It develops a taxonomy of linear and optimized steering vectors, input-contrast methods, and dynamic strength strategies, supported by theoretical notions like the Linear Representation Hypothesis and the Superposition Hypothesis. The work systematically compares RepE to prompt-engineering, fine-tuning, and mechanistic interpretability, and discusses evaluation pipelines, open problems, and ethical considerations. Its findings underscore the potential for inference-time control to achieve personalized, safe, and high-performing LLMs, while highlighting standardization, generalization, and multimodal challenges that must be addressed to deploy RepE broadly.

Abstract

Large-language models are capable of completing a variety of tasks, but remain unpredictable and intractable. Representation engineering seeks to resolve this problem through a new approach utilizing samples of contrasting inputs to detect and edit high-level representations of concepts such as honesty, harmfulness or power-seeking. We formalize the goals and methods of representation engineering to present a cohesive picture of work in this emerging discipline. We compare it with alternative approaches, such as mechanistic interpretability, prompt-engineering and fine-tuning. We outline risks such as performance decrease, compute time increases and steerability issues. We present a clear agenda for future research to build predictable, dynamic, safe and personalizable LLMs.

Paper Structure

This paper contains 108 sections, 14 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Representation Engineering: Representation Reading and Representation Control.
  • Figure 2: Entities in representation reading: concepts, tasks, functions.
  • Figure 3: A high-level example of a LAT experiment.
  • Figure 4: Graph showing Representation Reading, categorized by applications and type of representation.
  • Figure 5: Explanatory effect of an intervention on next token probabilities.
  • ...and 6 more figures