Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze, Sarthak Munshi, Bryan Sukidi, Jennifer Yen, Zejia Yang, David Williams-King, Linh Le, Kosi Asuzu, Carsten Maple
TL;DR
This survey maps Representation Engineering for LLMs into Reading and Control, arguing that high-level concepts are encoded in latent subspaces that can be read and edited without full retraining. It develops a taxonomy of linear and optimized steering vectors, input-contrast methods, and dynamic strength strategies, supported by theoretical notions like the Linear Representation Hypothesis and the Superposition Hypothesis. The work systematically compares RepE to prompt-engineering, fine-tuning, and mechanistic interpretability, and discusses evaluation pipelines, open problems, and ethical considerations. Its findings underscore the potential for inference-time control to achieve personalized, safe, and high-performing LLMs, while highlighting standardization, generalization, and multimodal challenges that must be addressed to deploy RepE broadly.
Abstract
Large-language models are capable of completing a variety of tasks, but remain unpredictable and intractable. Representation engineering seeks to resolve this problem through a new approach utilizing samples of contrasting inputs to detect and edit high-level representations of concepts such as honesty, harmfulness or power-seeking. We formalize the goals and methods of representation engineering to present a cohesive picture of work in this emerging discipline. We compare it with alternative approaches, such as mechanistic interpretability, prompt-engineering and fine-tuning. We outline risks such as performance decrease, compute time increases and steerability issues. We present a clear agenda for future research to build predictable, dynamic, safe and personalizable LLMs.
