Table of Contents
Fetching ...

ROAST: Rollout-based On-distribution Activation Steering Technique

Xuanbo Su, Hao Luo, Yingfang Zhang, Lijun Zhang

TL;DR

ROAST tackles driving LLM behavior at inference via efficient activation steering without fine-tuning. It introduces three components—ROC for on-distribution, Continuous Soft Scaling to preserve activation energy, and Grouped Mean Normalization to stabilize estimates—providing robust, scalable steering directions. Empirically, ROAST yields consistent gains across models from $0.6\mathrm{B}$ to $32\mathrm{B}$ and diverse tasks (e.g., GSM8K, TruthfulQA) and often rivals or exceeds few-shot prompts without in-context demonstrations. The results highlight the importance of aligning interventions with the model's native distribution and stabilizing magnitude across samples for reliable deployment.

Abstract

Activation steering provides parameter-efficient control over large language models (LLMs) at inference time, but many methods rely on off-distribution supervision and discrete masking, leading to brittle interventions. We propose ROAST (Rollout-based On-distribution Activation Steering Technique), which estimates steering directions from the model's own on-distribution rollouts via ROC and avoids hard sparsification via Continuous Soft Scaling (CSS) and Grouped Mean Normalization. Our empirical analysis reveals that while activation magnitude correlates moderately with directional consistency, the variance in magnitude is significant and often disproportionate to semantic quality. This suggests that high-magnitude activations risk dominating the global steering direction if not properly normalized. To address this, ROAST employs grouped normalization to balance contributions across samples, ensuring a more robust estimation of the consensus steering direction. Across models (0.6B to 32B), ROAST consistently improves performance on diverse tasks (e.g., +9.7% on GSM8K for Qwen3-0.6B and +12.1% on TruthfulQA for GLM4-32B), and analyses show that CSS better preserves activation energy.

ROAST: Rollout-based On-distribution Activation Steering Technique

TL;DR

ROAST tackles driving LLM behavior at inference via efficient activation steering without fine-tuning. It introduces three components—ROC for on-distribution, Continuous Soft Scaling to preserve activation energy, and Grouped Mean Normalization to stabilize estimates—providing robust, scalable steering directions. Empirically, ROAST yields consistent gains across models from to and diverse tasks (e.g., GSM8K, TruthfulQA) and often rivals or exceeds few-shot prompts without in-context demonstrations. The results highlight the importance of aligning interventions with the model's native distribution and stabilizing magnitude across samples for reliable deployment.

Abstract

Activation steering provides parameter-efficient control over large language models (LLMs) at inference time, but many methods rely on off-distribution supervision and discrete masking, leading to brittle interventions. We propose ROAST (Rollout-based On-distribution Activation Steering Technique), which estimates steering directions from the model's own on-distribution rollouts via ROC and avoids hard sparsification via Continuous Soft Scaling (CSS) and Grouped Mean Normalization. Our empirical analysis reveals that while activation magnitude correlates moderately with directional consistency, the variance in magnitude is significant and often disproportionate to semantic quality. This suggests that high-magnitude activations risk dominating the global steering direction if not properly normalized. To address this, ROAST employs grouped normalization to balance contributions across samples, ensuring a more robust estimation of the consensus steering direction. Across models (0.6B to 32B), ROAST consistently improves performance on diverse tasks (e.g., +9.7% on GSM8K for Qwen3-0.6B and +12.1% on TruthfulQA for GLM4-32B), and analyses show that CSS better preserves activation energy.
Paper Structure (51 sections, 6 equations, 12 figures, 10 tables, 1 algorithm)

This paper contains 51 sections, 6 equations, 12 figures, 10 tables, 1 algorithm.

Figures (12)

  • Figure 1: Quantifying activation distribution shift between teacher-forcing and rollouts (Qwen3-0.6B) in MLP activation. Cosine similarity $\cos(\mu_{\text{tf}}, \mu_{\text{ar}})$ and relative $L_2$ difference are calculated between the mean activation vectors $\mu_{\text{tf}}$ and $\mu_{\text{ar}}$, respectively.
  • Figure 2: Sparse masking versus CSS (Qwen3-8B on SST2). Left: Top-10% masking zeros out 90% of dimensions. Right: Cumulative energy distribution shows the top 10% of dimensions capture only $\sim$50% of the total energy, leading to massive information loss in discrete masking.
  • Figure 3: Relationship between activation magnitude and directional consistency (Qwen3-8B on GSM8k with layer 16). Magnitudes correlate moderately with consistency ($\rho \approx 0.46$) but exhibit extreme variance, consistent with outlier feature observations. Unnormalized aggregation risks domination by high-magnitude samples regardless of semantic quality.
  • Figure 4: Overview of ROAST. Step 1 (ROC): Sample on-distribution rollouts, extract activations at the last token, and form contrastive positive/negative pairs. Step 2 (CSS): Aggregate contrastive activation differences and apply continuous soft scaling via normalization to obtain a steering direction. Step 3: During inference, add the resulting steering vector (scaled by strength $\alpha$) to a chosen layer $l$ and component $c$ (e.g., MLP or Attention) to steer model outputs.
  • Figure 5: Rollout stability analysis on Qwen3-0.6B. Left: Steering vectors converge as rollout count increases. Right: Layer-wise $L_2$ norms remain stable across different rollout counts.
  • ...and 7 more figures