When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified
Gautam Siddharth Kashyap, Mark Dras, Usman Naseem
TL;DR
This work tackles the challenge of aligning LLMs to human values across multiple objectives—helpfulness, harmlessness, and honesty—by identifying Axis Collapse, a systemic interference between objectives. It introduces AlignX, a two-stage framework: Stage 1 uses prompt-injected fine-tuning to extract axis-specific task-feature representations, mitigating catastrophic forgetting; Stage 2 employs Mixture of Calibrated Experts (MoCaE) with fractal and natural calibrators to achieve per-instance, geometry- and semantics-aware routing and output calibration. Empirically, AlignX yields substantial improvements on Alpaca, BeaverTails, and TruthfulQA (e.g., +171.5% win rate, +110.1% TI, and 4.3% fewer safety violations) and reduces latency and memory usage by over 35% relative to prior MoE-based methods, with strong generalization across four LLM backbones. This modular, scalable approach enables safer, more trustworthy open-source LLMs and provides a blueprint for robust multi-objective alignment in practical deployments.
Abstract
Large Language Models (LLMs) need to be in accordance with human values-being helpful, harmless, and honest (HHH)-is important for safe deployment. Existing works use Supervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs. However, these works face challenges in multi-objective settings, such as SFT leading to interference between conflicting objectives, while MoEs suffer from miscalibrated routing. We term this failure mode Axis Collapse, marked by (1) disjoint feature spaces causing catastrophic forgetting, and (2) unreliable inference from misrouted experts. To resolve this, we propose AlignX, a two-stage framework. Stage 1 uses prompt-injected fine-tuning to extract axis-specific task features, mitigating catastrophic forgetting. Stage 2 deploys a MoCaE module that calibrates expert routing using fractal and natural geometry, improving inference reliability. AlignX achieves significant gains on Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty), with +171.5% win rate, +110.1% in truthfulness-informativeness, and 4.3% fewer safety violations. It also reduces latency and memory usage by over 35% compared to prior MoEs. Results across four LLMs validate its generalizability.
