Table of Contents
Fetching ...

Mind the Performance Gap: Capability-Behavior Trade-offs in Feature Steering

Eitan Sprejer, Oscar Agustín Stanchi, María Victoria Carro, Denise Alejandra Mester, Iván Arcuschin

TL;DR

The paper investigates feature steering as a mechanistic approach to control LLM behavior by editing internal representations and benchmarks Goodfire's Auto Steer against simple prompting across 14 steering queries on 171 MMLU items using Llama-8B and Llama-70B. It finds that Auto Steer increases behavioral control but causes substantial declines in accuracy and coherence, whereas simple prompting preserves task performance and coherence, offering the best practical balance. The results reveal a fundamental capability-behavior trade-off in current feature steering methods, challenging their deployment viability for real-world tasks where both control and accuracy matter. An open-source evaluation framework is released to enable broader, empirical comparisons across steering methods, strengths, and tasks.

Abstract

Feature steering has emerged as a promising approach for controlling LLM behavior through direct manipulation of internal representations, offering advantages over prompt engineering. However, its practical effectiveness in real-world applications remains poorly understood, particularly regarding potential trade-offs with output quality. We show that feature steering methods substantially degrade model performance even when successfully controlling target behaviors, a critical trade-off. Specifically, we evaluate Goodfire's Auto Steer against prompt engineering baselines across 14 steering queries (covering innocuous and safety-relevant behaviors) on 171 Massive Multitask Language Understanding (MMLU) questions using Llama-8B and Llama-70B, measuring accuracy, coherence, and behavioral control. Our findings show that Auto Steer successfully modifies target behaviors (achieving scores of 3.33 vs. 2.98 for prompting on Llama-8B and 3.57 vs. 3.10 on Llama-70B), but causes dramatic performance degradation: accuracy on the MMLU questions drops from 66% to 46% on Llama-8B and 87% to 73% on Llama-70B, with coherence falling from 4.62 to 2.24 and 4.94 to 3.89 respectively. Simple prompting achieves the best overall balance. These findings highlight limitations of current feature steering methods for practical deployment where task performance cannot be sacrificed. More broadly, our work demonstrates that mechanistic control methods face fundamental capability-behavior trade-offs that must be empirically characterized before deployment.

Mind the Performance Gap: Capability-Behavior Trade-offs in Feature Steering

TL;DR

The paper investigates feature steering as a mechanistic approach to control LLM behavior by editing internal representations and benchmarks Goodfire's Auto Steer against simple prompting across 14 steering queries on 171 MMLU items using Llama-8B and Llama-70B. It finds that Auto Steer increases behavioral control but causes substantial declines in accuracy and coherence, whereas simple prompting preserves task performance and coherence, offering the best practical balance. The results reveal a fundamental capability-behavior trade-off in current feature steering methods, challenging their deployment viability for real-world tasks where both control and accuracy matter. An open-source evaluation framework is released to enable broader, empirical comparisons across steering methods, strengths, and tasks.

Abstract

Feature steering has emerged as a promising approach for controlling LLM behavior through direct manipulation of internal representations, offering advantages over prompt engineering. However, its practical effectiveness in real-world applications remains poorly understood, particularly regarding potential trade-offs with output quality. We show that feature steering methods substantially degrade model performance even when successfully controlling target behaviors, a critical trade-off. Specifically, we evaluate Goodfire's Auto Steer against prompt engineering baselines across 14 steering queries (covering innocuous and safety-relevant behaviors) on 171 Massive Multitask Language Understanding (MMLU) questions using Llama-8B and Llama-70B, measuring accuracy, coherence, and behavioral control. Our findings show that Auto Steer successfully modifies target behaviors (achieving scores of 3.33 vs. 2.98 for prompting on Llama-8B and 3.57 vs. 3.10 on Llama-70B), but causes dramatic performance degradation: accuracy on the MMLU questions drops from 66% to 46% on Llama-8B and 87% to 73% on Llama-70B, with coherence falling from 4.62 to 2.24 and 4.94 to 3.89 respectively. Simple prompting achieves the best overall balance. These findings highlight limitations of current feature steering methods for practical deployment where task performance cannot be sacrificed. More broadly, our work demonstrates that mechanistic control methods face fundamental capability-behavior trade-offs that must be empirically characterized before deployment.
Paper Structure (16 sections, 5 figures, 2 tables)

This paper contains 16 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Distribution of behavior and coherence scores across steering methods for Llama-8B. Violin plots show the full distribution of scores, with white dots indicating medians and thick black bars showing interquartile ranges.
  • Figure 2: Distribution of behavior and coherence scores across steering methods for Llama-70B. Violin plots show the full distribution of scores, with white dots indicating medians and thick black bars showing interquartile ranges.
  • Figure 3: Behavior, coherence, and accuracy scores by steering query and method for Llama-8B. Left: Behavior scores show strong steering effects for emotional/creative queries. Center: Coherence scores reveal severe degradation under Auto Steer for these same queries. Right: Accuracy scores demonstrate the performance costs of feature steering.
  • Figure 4: Behavior, coherence, and accuracy scores by steering query and method for Llama-70B. The larger model shows similar patterns to Llama-8B but with attenuated degradation effects.
  • Figure 5: Relationship between coherence and accuracy across all responses. Points represent mean accuracy within coherence bins.