Table of Contents
Fetching ...

SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

Wei Xia, Zhi-Hong Deng

TL;DR

SDA addresses the core challenge of aligning open-source LLMs to human intent without retraining. It introduces a three-stage inference-time pipeline—score-guided amplification, steering-based logit realignment, and divergence-aware temperature scaling—to redistribute output probabilities toward user-aligned behavior. Empirical results across eight open LLMs and five datasets show substantial improvements in helpfulness (avg +64.4%), honesty (avg +30%), and harmlessness (avg +11.5%), outperforming a training-based baseline while requiring no weight updates. SDA's lightweight, model-agnostic approach enables personalized alignment and easy integration with existing workflows, though it relies on external scoring and is presently tailored to open models with log-probability outputs. The work suggests broad applicability and potential synergy with training-time methods for robust, scalable alignment in real-world deployments.

Abstract

With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.

SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

TL;DR

SDA addresses the core challenge of aligning open-source LLMs to human intent without retraining. It introduces a three-stage inference-time pipeline—score-guided amplification, steering-based logit realignment, and divergence-aware temperature scaling—to redistribute output probabilities toward user-aligned behavior. Empirical results across eight open LLMs and five datasets show substantial improvements in helpfulness (avg +64.4%), honesty (avg +30%), and harmlessness (avg +11.5%), outperforming a training-based baseline while requiring no weight updates. SDA's lightweight, model-agnostic approach enables personalized alignment and easy integration with existing workflows, though it relies on external scoring and is presently tailored to open models with log-probability outputs. The work suggests broad applicability and potential synergy with training-time methods for robust, scalable alignment in real-world deployments.

Abstract

With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.

Paper Structure

This paper contains 42 sections, 13 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview of the SDA framework, designed to redistribute the output probabilities of the model, $\boldsymbol{P}(x_t | \mathcal{Q}, \mathcal{I})$, based on query $\mathcal{Q}$ and alignment instruction $\mathcal{I}$. Given a user query $\mathcal{Q}$, SDA first samples an initial response from the base LLM and obtains an alignment score $S$ ($0 < S \leq 100$) for that response using an external evaluator (such as a stronger LLM). Next, SDA converts the score into an amplifying factor $a$ via a smooth sigmoid-based transformation. Finally, SDA performs token-level steering to adjust the output distribution of the base LLM with amplifying factor $a$, while dynamically calibrating the sampling temperature $T$ based on JS divergence, enhancing alignment between model behavior and human intent.
  • Figure 2: Illustration of SDA on Output Distribution.
  • Figure 3: Illustration of Alignment on Output Distribution. The left side shows the original unaligned distribution, while the right side illustrates the aligned distribution via any alignment strategies.
  • Figure 4: Illustration of Function $F(S)$. The function maps the alignment score $S$ to a steering factor $a$, which is used to adjust the logits of the output distribution. The function is designed to be sensitive to the alignment score, with lower scores leading to stronger steering factors.
  • Figure 5: Illustration of Temperature Scaling. The function computes the Jensen-Shannon divergence (JS divergence) between the original output distribution and the instruction-aligned output distribution to determine the intensity of the adjustment to the temperature.
  • ...and 5 more figures