Table of Contents
Fetching ...

COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

Kartik Sharma, Rakshit S. Trivedi

TL;DR

COLD-Steer is introduced, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples, and opens new possibilities for adaptive, context-aware model control.

Abstract

Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.

COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

TL;DR

COLD-Steer is introduced, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples, and opens new possibilities for adaptive, context-aware model control.

Abstract

Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.
Paper Structure (32 sections, 2 theorems, 5 equations, 5 figures, 18 tables)

This paper contains 32 sections, 2 theorems, 5 equations, 5 figures, 18 tables.

Key Result

Corollary 1

DiffMean or difference of means panickssery2023caa is equivalent to $\Delta \mathbf{Z}\xspace^{(\kappa)}(\mathbf{x}\xspace; \theta)$ with the loss function $\mathcal{L}\xspace(\mathcal{M}\xspace(\tilde{\mathbf{x}\xspace}_i), \tilde{\mathbf{y}\xspace}_i) = - \sum_{i}\lVert\mathbf{Z}\xspace(\tilde{\ma

Figures (5)

  • Figure 1: Comparison of steering methods based on their efficiency and steerability. The adjoining figure shows a representative trend for steering accuracy versus number of samples.
  • Figure 2: Steering with in-Context One-step Learning Dynamics: Given the in-context examples for the desired behavior, we steer an activation $\mathbf{Z}\xspace$ for a new prompt $\mathbf{x}\xspace$ by approximately the amount that it will change when its parameters are moved in the direction of the gradient of a loss function over the examples. In particular, we use the finite-difference (FD) and kernel approximations.
  • Figure 3: Steering accuracy of Llama-2-7b-hf on the CAA dataset for varying number of examples.
  • Figure 4: Mean judge scores (out of 10) for generations on the CAA dataset (standard deviation $\le$ 0.5).
  • Figure 4: Accuracy of desired behavior on CAA dataset compared with the contrastive steering vector (DiffMean) by varying the number of samples that describe the behavior.

Theorems & Definitions (2)

  • Corollary 1
  • Corollary 2