Table of Contents
Fetching ...

Controlling Chat Style in Language Models via Single-Direction Editing

Zhenyu Xu, Victor S. Sheng

TL;DR

This paper provides strong empirical evidence for the hypothesis that distinct stylistic attributes - from emotional tone to linguistic structure - are encoded as linear directions in the model's activation space and presents a lightweight, training-free method for precise style control.

Abstract

Controlling stylistic attributes in large language models (LLMs) remains challenging, with existing approaches relying on either prompt engineering or post-training alignment. This paper investigates this challenge through the lens of representation engineering, testing the hypothesis that distinct stylistic attributes - from emotional tone to linguistic structure - are encoded as linear directions in the model's activation space. We provide strong empirical evidence for this hypothesis across a wide range of styles and, based on this finding, present a lightweight, training-free method for precise style control. Our approach supports linear style composition, enhances safety by ablating undesirable behaviors, and, as confirmed by experiments on over a dozen models, achieves high style adherence while preserving core capabilities at minimal computational cost.

Controlling Chat Style in Language Models via Single-Direction Editing

TL;DR

This paper provides strong empirical evidence for the hypothesis that distinct stylistic attributes - from emotional tone to linguistic structure - are encoded as linear directions in the model's activation space and presents a lightweight, training-free method for precise style control.

Abstract

Controlling stylistic attributes in large language models (LLMs) remains challenging, with existing approaches relying on either prompt engineering or post-training alignment. This paper investigates this challenge through the lens of representation engineering, testing the hypothesis that distinct stylistic attributes - from emotional tone to linguistic structure - are encoded as linear directions in the model's activation space. We provide strong empirical evidence for this hypothesis across a wide range of styles and, based on this finding, present a lightweight, training-free method for precise style control. Our approach supports linear style composition, enhances safety by ablating undesirable behaviors, and, as confirmed by experiments on over a dozen models, achieves high style adherence while preserving core capabilities at minimal computational cost.
Paper Structure (39 sections, 4 equations, 6 figures, 8 tables)

This paper contains 39 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Single direction vector steering chat-style: editing style vectors transform a neutral LLM into expressive personas. Linear addition of vectors also yields hybrid styles.
  • Figure 2: Overview of our style direction extraction and orthogonalization approach. The modified model generates outputs that consistently exhibit the target style.
  • Figure 3: Chat-style controllable image descriptions using our modified LLaVA-1.5 model. (a) Original output from the base model; (b) Pessimistic output after injecting a chat-style vector. Both are generated from the same image input.
  • Figure 4: Chat-style controllable image descriptions using our modified LLaVA-1.5 model. (a) Original output from the base model; (b) Safer output after injecting a chat-style vector. Both are generated from the same image input.
  • Figure 5: GPT-4 Eval Scores for base models and their chat-style edited variants across 14 instruction-tuned models. Chat-style edits consistently preserve high generation quality across architectures.
  • ...and 1 more figures