We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong
Gautam Siddharth Kashyap, Mark Dras, Usman Naseem
TL;DR
This work tackles the problem of aligning large language models along multiple objectives—helpfulness, harmlessness, and honesty (HHH)—without suffering from catastrophic forgetting or inference fragmentation. It introduces Adaptive Multi-Branch Steering (AMBS), a two-stage 1-to-N framework that shares a base representation across axes (Stage I) and applies axis-specific steering through a policy–reference mechanism (Stage II), guided by a cosine-based objective. Empirical results on Alpaca, BeaverTails, TruthfulQA, and backbones like DeepSeek-7B show that AMBS improves multi-axis HHH alignment, reduces unsafe outputs, and maintains cross-axis consistency, though backbone sensitivity remains a factor. The method achieves notable gains (e.g., up to +32.4% Avg on certain backbones) while reducing inference fragmentation, indicating practical potential for safer, more reliable multi-objective LLM deployment. The findings are supported by ablations, generalization tests, and a small human evaluation, highlighting both the promise and areas for further refinement in scaling and robustness.
Abstract
Alignment of Large Language Models (LLMs) along multiple objectives-helpfulness, harmlessness, and honesty (HHH)-is critical for safe and reliable deployment. Prior work has used steering vector-small control signals injected into hidden states-to guide LLM outputs, typically via one-to-one (1-to-1) Transformer decoders. In this setting, optimizing a single alignment objective can inadvertently overwrite representations learned for other objectives, leading to catastrophic forgetting. More recent approaches extend steering vectors via one-to-many (1-to-N) Transformer decoders. While this alleviates catastrophic forgetting, naive multi-branch designs optimize each objective independently, which can cause inference fragmentation-outputs across HHH objectives may become inconsistent. We propose Adaptive Multi-Branch Steering (AMBS), a two-stage 1-to-N framework for unified and efficient multi-objective alignment. In Stage I, post-attention hidden states of the Transformer layer are computed once to form a shared representation. In Stage II, this representation is cloned into parallel branches and steered via a policy-reference mechanism, enabling objective-specific control while maintaining cross-objective consistency. Empirical evaluations on Alpaca, BeaverTails, and TruthfulQA show that AMBS consistently improves HHH alignment across multiple 7B LLM backbones. For example, on DeepSeek-7B, AMBS improves average alignment scores by +32.4% and reduces unsafe outputs by 11.0% compared to a naive 1-to-N baseline, while remaining competitive with state-of-the-art methods.
