Table of Contents
Fetching ...

Style Vectors for Steering Generative Large Language Model

Kai Konen, Sophie Jentzsch, Diaoulé Diallo, Peer Schütt, Oliver Bensch, Roxanne El Baff, Dominik Opitz, Tobias Hecking

TL;DR

This work investigates steering large language models toward predefined styles by injecting style vectors into hidden activations during generation, comparing training-based steering vectors with activation-based style vectors. Using Alpaca-7B, it demonstrates that activation-based vectors derived from layer activations—particularly from layers 18–20—provide nuanced, parameterizable control over sentiment, emotion, and writing style with much lower computational cost than training-based methods. Probing shows activations encode style information strongly, and evaluation on Yelp, GoEmotions, and Shakespeare indicates activation-based steering frequently yields equal or superior style control while maintaining output quality on subjective prompts. The study discusses ethical considerations, limitations for factual prompts, and the potential of activation engineering to enable adaptive, style-aware interactions in AI systems.

Abstract

This research explores strategies for steering the output of large language models (LLMs) towards specific styles, such as sentiment, emotion, or writing style, by adding style vectors to the activations of hidden layers during text generation. We show that style vectors can be simply computed from recorded layer activations for input texts in a specific style in contrast to more complex training-based approaches. Through a series of experiments, we demonstrate the effectiveness of activation engineering using such style vectors to influence the style of generated text in a nuanced and parameterisable way, distinguishing it from prompt engineering. The presented research constitutes a significant step towards developing more adaptive and effective AI-empowered interactive systems.

Style Vectors for Steering Generative Large Language Model

TL;DR

This work investigates steering large language models toward predefined styles by injecting style vectors into hidden activations during generation, comparing training-based steering vectors with activation-based style vectors. Using Alpaca-7B, it demonstrates that activation-based vectors derived from layer activations—particularly from layers 18–20—provide nuanced, parameterizable control over sentiment, emotion, and writing style with much lower computational cost than training-based methods. Probing shows activations encode style information strongly, and evaluation on Yelp, GoEmotions, and Shakespeare indicates activation-based steering frequently yields equal or superior style control while maintaining output quality on subjective prompts. The study discusses ethical considerations, limitations for factual prompts, and the potential of activation engineering to enable adaptive, style-aware interactions in AI systems.

Abstract

This research explores strategies for steering the output of large language models (LLMs) towards specific styles, such as sentiment, emotion, or writing style, by adding style vectors to the activations of hidden layers during text generation. We show that style vectors can be simply computed from recorded layer activations for input texts in a specific style in contrast to more complex training-based approaches. Through a series of experiments, we demonstrate the effectiveness of activation engineering using such style vectors to influence the style of generated text in a nuanced and parameterisable way, distinguishing it from prompt engineering. The presented research constitutes a significant step towards developing more adaptive and effective AI-empowered interactive systems.
Paper Structure (32 sections, 5 equations, 13 figures)

This paper contains 32 sections, 5 equations, 13 figures.

Figures (13)

  • Figure 1: The LLM output is steered by adding style vectors to selected layers (e.g., layers 18-20) during a forward pass. For example, the answer of the LLM to the input prompt "How is the weather?" is steered towards a positive style, with a sample answer of "The weather is great!", a positive answer.
  • Figure 2: Extraction of an activation vector (left): The LLMs' values at layer $i$ for a prompt in the target style are saved for later computation of style vectors. Trained steering vectors (right): The values of the vectors are optimized over $j=400$ epochs such that the model produces a specified sentence in the target style from a simple beginning of a sentence (BOS) token.
  • Figure 3: Classification results on the Yelp review dataset: Using (a) only the 470 trained steering vectors, (b) the corresponding activation vectors, and (c) selected layers of activation vectors of 10k sentences. The activation vectors show superior performance in their ability to predict the sentiment of an input sentence.
  • Figure 4: Steering of the Yelp Review samples towards positive (upper plots) and negative (lower plots) sentiment.
  • Figure 5: Activation-based style vectors: Evaluation of generated texts for subjective prompts using GoEmotions' style vectors. All activation vectors were used.
  • ...and 8 more figures