FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering

Zikang Ding; Qiying Hu; Yi Zhang; Hongji Li; Junchi Yao; Hongbo Liu; Lijie Hu

FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering

Zikang Ding, Qiying Hu, Yi Zhang, Hongji Li, Junchi Yao, Hongbo Liu, Lijie Hu

Abstract

Inference-time steering is widely regarded as a lightweight and parameter-free mechanism for controlling large language model (LLM) behavior, and prior work has often suggested that simple activation-level interventions can reliably induce targeted behavioral changes. However, such conclusions are typically drawn under relatively relaxed evaluation settings that overlook deployment constraints, capability trade-offs, and real-world robustness. We therefore introduce \textbf{FaithSteer-BENCH}, a stress-testing benchmark that evaluates steering methods at a fixed deployment-style operating point through three gate-wise criteria: controllability, utility preservation, and robustness. Across multiple models and representative steering approaches, we uncover several systematic failure modes that are largely obscured under standard evaluation, including illusory controllability, measurable cognitive tax on unrelated capabilities, and substantial brittleness under mild instruction-level perturbations, role prompts, encoding transformations, and data scarcity. Gate-wise benchmark results show that existing methods do not necessarily provide reliable controllability in deployment-oriented practical settings. In addition, mechanism-level diagnostics indicate that many steering methods induce prompt-conditional alignment rather than stable latent directional shifts, further explaining their fragility under stress. FaithSteer-BENCH therefore provides a unified benchmark and a clearer analytical lens for future method design, reliability evaluation, and deployment-oriented research in steering.

FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering

Abstract

Paper Structure (70 sections, 19 equations, 10 figures, 20 tables)

This paper contains 70 sections, 19 equations, 10 figures, 20 tables.

Introduction
Related Works
Inference-Time Steering Methods
Benchmarks for Steering and Steerability
FaithSteer-BENCH
Standardized Steering Interface and Deployment Constraint
Stress Taxonomy
Red Teaming Stress.
OOD Stress.
Hybrid Stress.
Evaluation Axes and Metrics
Reference Operating Point
Calibration splits and aggregation.
Selecting $\alpha^*(\mathcal{S})$ under deployment constraints.
Stability preference.
...and 55 more sections

Figures (10)

Figure 1: Steering evaluation landscape and the deployment-reliability gap addressed by FaithSteer-BENCH.
Figure 2: Overview of FaithSteer-BENCH. Steering methods are evaluated through three stages: clean controllability, capability preservation, and robustness under stress. The results are then converted into gate-wise deployment verdicts.
Figure 3: General capability under steering across four base models. Bars show the average capability score over RACE, MMLU, OpenBookQA, and GLUE, further averaged over the eight steering tasks. Higher is better.
Figure 4: Mechanism-level diagnostics for three representative cases.
Figure 5: Radar plots of steering performance (ACC and APC) on eight datasets. Columns denote different backbone models and rows correspond to ACC (top) and APC (bottom). Each plot compares Base (no steering) with CAA, PCA, TopPC, and ITI. Larger areas indicate better steering performance.
...and 5 more figures

FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering

Abstract

FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering

Authors

Abstract

Table of Contents

Figures (10)