Table of Contents
Fetching ...

Representation Tuning

Christopher M. Ackerman

TL;DR

Overall, fine-tuning the vectors into the models using the cosine similarity plus token loss showed a stronger effect than online steering, and generalized better than using the standard loss, suggesting the potential utility of this approach as a safety measure.

Abstract

Activation engineering is becoming increasingly popular as a means of online control of large language models (LLMs). In this work, we extend the idea of inference-time steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, we identify activation vectors related to honesty in an open-source LLM (Llama-2-13b-chat). Next, we demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, we show that a similar effect can be achieved by fine-tuning the vectors directly into the model, by use of a dual loss function based on the cosine similarity of residual stream activations to the vectors combined with a standard token-based loss ("representation tuning"). Finally, we compare the generations in response to honesty-probing prompts from the resulting models to those from models fine-tuned with a token-based loss alone, and to those from the untuned model subjected to online steering. Overall, fine-tuning the vectors into the models using the cosine similarity plus token loss showed a stronger effect than online steering, and generalized better than using the standard loss, suggesting the potential utility of this approach as a safety measure. Code and data are available at https://github.com/cma1114/representation_tuning. Tuned models are available at https://huggingface.co/collections/cackerman/representation-tuning-66da1e5ab41cd1b824687d9f.

Representation Tuning

TL;DR

Overall, fine-tuning the vectors into the models using the cosine similarity plus token loss showed a stronger effect than online steering, and generalized better than using the standard loss, suggesting the potential utility of this approach as a safety measure.

Abstract

Activation engineering is becoming increasingly popular as a means of online control of large language models (LLMs). In this work, we extend the idea of inference-time steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, we identify activation vectors related to honesty in an open-source LLM (Llama-2-13b-chat). Next, we demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, we show that a similar effect can be achieved by fine-tuning the vectors directly into the model, by use of a dual loss function based on the cosine similarity of residual stream activations to the vectors combined with a standard token-based loss ("representation tuning"). Finally, we compare the generations in response to honesty-probing prompts from the resulting models to those from models fine-tuned with a token-based loss alone, and to those from the untuned model subjected to online steering. Overall, fine-tuning the vectors into the models using the cosine similarity plus token loss showed a stronger effect than online steering, and generalized better than using the standard loss, suggesting the potential utility of this approach as a safety measure. Code and data are available at https://github.com/cma1114/representation_tuning. Tuned models are available at https://huggingface.co/collections/cackerman/representation-tuning-66da1e5ab41cd1b824687d9f.
Paper Structure (13 sections, 7 figures, 6 tables)

This paper contains 13 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Steering and tuning effects: simple facts dataset. "Truth/Lie Tuned" are models tuned with standard cross-entropy loss; "Honesty/Dishonesty Tuned" are representation-tuned models.
  • Figure 2: Steering and tuning effects: ambiguous TQA dataset.
  • Figure 3: Logit Lens applied to +/- honesty vectors. Layers where steering/tuning was most effective are highlighted.
  • Figure 4: Cosine similarities with honesty vector during generation (beginning after position 0) in response to TQA prompts. A: Untuned model. B: Honesty-tuned model. C: Truth-tuned model. The untuned model shows moderate correlations around the token position used for the vector (-7) and around response generation in the middle and later layers. The honesty-tuned model shows strong correlations at the layers targeted for tuning. The truth-tuned model shows lower correlations than the untuned model, suggesting it is using a different mechanism to produce correct answers.
  • Figure 5: Example of the dishonesty-tuned model's unlimited-length response to one of the morality questions.
  • ...and 2 more figures