Table of Contents
Fetching ...

We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

Gautam Siddharth Kashyap, Mark Dras, Usman Naseem

TL;DR

This work tackles the problem of aligning large language models along multiple objectives—helpfulness, harmlessness, and honesty (HHH)—without suffering from catastrophic forgetting or inference fragmentation. It introduces Adaptive Multi-Branch Steering (AMBS), a two-stage 1-to-N framework that shares a base representation across axes (Stage I) and applies axis-specific steering through a policy–reference mechanism (Stage II), guided by a cosine-based objective. Empirical results on Alpaca, BeaverTails, TruthfulQA, and backbones like DeepSeek-7B show that AMBS improves multi-axis HHH alignment, reduces unsafe outputs, and maintains cross-axis consistency, though backbone sensitivity remains a factor. The method achieves notable gains (e.g., up to +32.4% Avg on certain backbones) while reducing inference fragmentation, indicating practical potential for safer, more reliable multi-objective LLM deployment. The findings are supported by ablations, generalization tests, and a small human evaluation, highlighting both the promise and areas for further refinement in scaling and robustness.

Abstract

Alignment of Large Language Models (LLMs) along multiple objectives-helpfulness, harmlessness, and honesty (HHH)-is critical for safe and reliable deployment. Prior work has used steering vector-small control signals injected into hidden states-to guide LLM outputs, typically via one-to-one (1-to-1) Transformer decoders. In this setting, optimizing a single alignment objective can inadvertently overwrite representations learned for other objectives, leading to catastrophic forgetting. More recent approaches extend steering vectors via one-to-many (1-to-N) Transformer decoders. While this alleviates catastrophic forgetting, naive multi-branch designs optimize each objective independently, which can cause inference fragmentation-outputs across HHH objectives may become inconsistent. We propose Adaptive Multi-Branch Steering (AMBS), a two-stage 1-to-N framework for unified and efficient multi-objective alignment. In Stage I, post-attention hidden states of the Transformer layer are computed once to form a shared representation. In Stage II, this representation is cloned into parallel branches and steered via a policy-reference mechanism, enabling objective-specific control while maintaining cross-objective consistency. Empirical evaluations on Alpaca, BeaverTails, and TruthfulQA show that AMBS consistently improves HHH alignment across multiple 7B LLM backbones. For example, on DeepSeek-7B, AMBS improves average alignment scores by +32.4% and reduces unsafe outputs by 11.0% compared to a naive 1-to-N baseline, while remaining competitive with state-of-the-art methods.

We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

TL;DR

This work tackles the problem of aligning large language models along multiple objectives—helpfulness, harmlessness, and honesty (HHH)—without suffering from catastrophic forgetting or inference fragmentation. It introduces Adaptive Multi-Branch Steering (AMBS), a two-stage 1-to-N framework that shares a base representation across axes (Stage I) and applies axis-specific steering through a policy–reference mechanism (Stage II), guided by a cosine-based objective. Empirical results on Alpaca, BeaverTails, TruthfulQA, and backbones like DeepSeek-7B show that AMBS improves multi-axis HHH alignment, reduces unsafe outputs, and maintains cross-axis consistency, though backbone sensitivity remains a factor. The method achieves notable gains (e.g., up to +32.4% Avg on certain backbones) while reducing inference fragmentation, indicating practical potential for safer, more reliable multi-objective LLM deployment. The findings are supported by ablations, generalization tests, and a small human evaluation, highlighting both the promise and areas for further refinement in scaling and robustness.

Abstract

Alignment of Large Language Models (LLMs) along multiple objectives-helpfulness, harmlessness, and honesty (HHH)-is critical for safe and reliable deployment. Prior work has used steering vector-small control signals injected into hidden states-to guide LLM outputs, typically via one-to-one (1-to-1) Transformer decoders. In this setting, optimizing a single alignment objective can inadvertently overwrite representations learned for other objectives, leading to catastrophic forgetting. More recent approaches extend steering vectors via one-to-many (1-to-N) Transformer decoders. While this alleviates catastrophic forgetting, naive multi-branch designs optimize each objective independently, which can cause inference fragmentation-outputs across HHH objectives may become inconsistent. We propose Adaptive Multi-Branch Steering (AMBS), a two-stage 1-to-N framework for unified and efficient multi-objective alignment. In Stage I, post-attention hidden states of the Transformer layer are computed once to form a shared representation. In Stage II, this representation is cloned into parallel branches and steered via a policy-reference mechanism, enabling objective-specific control while maintaining cross-objective consistency. Empirical evaluations on Alpaca, BeaverTails, and TruthfulQA show that AMBS consistently improves HHH alignment across multiple 7B LLM backbones. For example, on DeepSeek-7B, AMBS improves average alignment scores by +32.4% and reduces unsafe outputs by 11.0% compared to a naive 1-to-N baseline, while remaining competitive with state-of-the-art methods.

Paper Structure

This paper contains 27 sections, 3 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Motivation for AMBS in HHH alignment. Left (Qualitative): A shared user prompt is processed by a 1-to-N Transformer. Naïve multi-branch decoding produces inconsistent outputs across objectives: the helpfulness branch yields vague and non-actionable text, the harmlessness branch produces unsafe advice, and the honesty branch generates factually false content. In contrast, AMBS produces coordinated responses that are simultaneously HHH. Right (Quantitative): t-SNE visualization of post-attention hidden states from LLaMA-2-7B (last layer, perplexity=25, seed=42). Naïve 1-to-N branches diverge into disjoint clusters, illustrating inference fragmentation. AMBS branches overlap substantially, indicating that adaptive steering preserves coordinated hidden representations across HHH objectives.
  • Figure 2: Overview of Adaptive Multi-Branch Steering (AMBS) via a 1-to-N Transformer. Stage I computes shared post-attention hidden states once, providing a common representation for all objectives. Stage II clones these states into parallel branches, injects branch-specific steering vectors, and applies policy–reference updates to produce outputs aligned along HHH simultaneously and efficiently. This design avoids redundant computation, prevents catastrophic forgetting, and mitigates inference fragmentation.
  • Figure 3: Hidden state update verification per steering axis via LLaMA-2-7B. Top: Norm before vs. after steering. Bottom: Cosine similarity with target vector and $\Delta$ alignment scores (WR, SS, TI).
  • Figure 4: Effect of steering layer ($\ell$) on LLaMA-2-7B.
  • Figure 5: Effect of steering magnitude $\alpha$ (LLaMA-2-7B, $\ell=32$). Left: overall trends (WR$\uparrow$, SS$\downarrow$, TI$\uparrow$, Avg$\uparrow$). Right: per-axis breakdown (HHH). Moderate steering ($\alpha=1.0$) achieves the best balance, while too weak ($\alpha=0.25,0.5$) or too strong ($\alpha=2.0$) magnitudes reduce Avg due to under-or over-steering.
  • ...and 1 more figures