Table of Contents
Fetching ...

Word Embeddings Are Steers for Language Models

Chi Han, Jialiang Xu, Manling Li, Yi Fung, Chenkai Sun, Nan Jiang, Tarek Abdelzaher, Heng Ji

TL;DR

This paper introduces LM-Steer, a lightweight, architecture-agnostic method that steers language model generation by applying a linear transformation to output word embeddings, parameterized by a steering value $\epsilon$ and a learnable matrix $W$. It demonstrates that generation styles can be modulated continuously and compositionally across models of various sizes, with only about $0.2\%$ of the original LM parameters needed for steering each style. Empirically, LM-Steer delivers strong performance on detoxification and sentiment-control tasks, while preserving fluency and diversity, and it supports transfer across models via explicit form calculations. The approach also provides a window into interpretability, revealing style-associated embedding directions and allowing identification of indicative keywords, and it enables continuous and compositional control by simple combination and scaling of steering components. Overall, LM-Steer offers a practical, transferable, and interpretable mechanism for fine-grained control of language generation with potential for safer and more customized AI systems.

Abstract

Language models (LMs) automatically learn word embeddings during pre-training on language corpora. Although word embeddings are usually interpreted as feature vectors for individual words, their roles in language model generation remain underexplored. In this work, we theoretically and empirically revisit output word embeddings and find that their linear transformations are equivalent to steering language model generation styles. We name such steers LM-Steers and find them existing in LMs of all sizes. It requires learning parameters equal to 0.2% of the original LMs' size for steering each style. On tasks such as language model detoxification and sentiment control, LM-Steers can achieve comparable or superior performance compared with state-of-the-art controlled generation methods while maintaining a better balance with generation quality. The learned LM-Steer serves as a lens in text styles: it reveals that word embeddings are interpretable when associated with language model generations and can highlight text spans that most indicate the style differences. An LM-Steer is transferrable between different language models by an explicit form calculation. One can also continuously steer LMs simply by scaling the LM-Steer or compose multiple LM-Steers by adding their transformations. Our codes are publicly available at \url{https://github.com/Glaciohound/LM-Steer}.

Word Embeddings Are Steers for Language Models

TL;DR

This paper introduces LM-Steer, a lightweight, architecture-agnostic method that steers language model generation by applying a linear transformation to output word embeddings, parameterized by a steering value and a learnable matrix . It demonstrates that generation styles can be modulated continuously and compositionally across models of various sizes, with only about of the original LM parameters needed for steering each style. Empirically, LM-Steer delivers strong performance on detoxification and sentiment-control tasks, while preserving fluency and diversity, and it supports transfer across models via explicit form calculations. The approach also provides a window into interpretability, revealing style-associated embedding directions and allowing identification of indicative keywords, and it enables continuous and compositional control by simple combination and scaling of steering components. Overall, LM-Steer offers a practical, transferable, and interpretable mechanism for fine-grained control of language generation with potential for safer and more customized AI systems.

Abstract

Language models (LMs) automatically learn word embeddings during pre-training on language corpora. Although word embeddings are usually interpreted as feature vectors for individual words, their roles in language model generation remain underexplored. In this work, we theoretically and empirically revisit output word embeddings and find that their linear transformations are equivalent to steering language model generation styles. We name such steers LM-Steers and find them existing in LMs of all sizes. It requires learning parameters equal to 0.2% of the original LMs' size for steering each style. On tasks such as language model detoxification and sentiment control, LM-Steers can achieve comparable or superior performance compared with state-of-the-art controlled generation methods while maintaining a better balance with generation quality. The learned LM-Steer serves as a lens in text styles: it reveals that word embeddings are interpretable when associated with language model generations and can highlight text spans that most indicate the style differences. An LM-Steer is transferrable between different language models by an explicit form calculation. One can also continuously steer LMs simply by scaling the LM-Steer or compose multiple LM-Steers by adding their transformations. Our codes are publicly available at \url{https://github.com/Glaciohound/LM-Steer}.
Paper Structure (32 sections, 6 theorems, 16 equations, 5 figures, 16 tables)

This paper contains 32 sections, 6 theorems, 16 equations, 5 figures, 16 tables.

Key Result

Theorem 1

(Informal) With certain assumptions, shifting styles in language models is equivalent to a linear transformation in word embedding space.

Figures (5)

  • Figure 1: We find hidden steers in output word embeddings. By linearly transforming word embeddings, language model generations are "steered" toward different style polarity and levels.
  • Figure 2: An overview of LM-Steer. (a): LM-Steer applies a linear factor $\epsilon W \mathbf{e}_v$ to each word embedding for language model conditioning. (b): During training, we use a positively steered model $P_{\epsilon W}$ to maximize likelihood on positively labeled texts, and vise versa. (c): For generation, one only needs to specify a steering value $\epsilon$, and then proceed with normal decoding.
  • Figure 3: Across base model sizes, LM-Steered GPT2 family, Pythia family, GPT-J and Llama-2-7B models (+) consistently outperform other baselines (($\square$)) on detoxification. X$^\oplus$ means an LM-Steered language model X.
  • Figure 4: Continuous and compositional control using LM-Steer.
  • Figure 5: Measuring the transferability and data efficiency of LM-Steer.

Theorems & Definitions (10)

  • Theorem 1
  • Theorem 2
  • Proposition 1
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • proof