Table of Contents
Fetching ...

When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified

Gautam Siddharth Kashyap, Mark Dras, Usman Naseem

TL;DR

This work tackles the challenge of aligning LLMs to human values across multiple objectives—helpfulness, harmlessness, and honesty—by identifying Axis Collapse, a systemic interference between objectives. It introduces AlignX, a two-stage framework: Stage 1 uses prompt-injected fine-tuning to extract axis-specific task-feature representations, mitigating catastrophic forgetting; Stage 2 employs Mixture of Calibrated Experts (MoCaE) with fractal and natural calibrators to achieve per-instance, geometry- and semantics-aware routing and output calibration. Empirically, AlignX yields substantial improvements on Alpaca, BeaverTails, and TruthfulQA (e.g., +171.5% win rate, +110.1% TI, and 4.3% fewer safety violations) and reduces latency and memory usage by over 35% relative to prior MoE-based methods, with strong generalization across four LLM backbones. This modular, scalable approach enables safer, more trustworthy open-source LLMs and provides a blueprint for robust multi-objective alignment in practical deployments.

Abstract

Large Language Models (LLMs) need to be in accordance with human values-being helpful, harmless, and honest (HHH)-is important for safe deployment. Existing works use Supervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs. However, these works face challenges in multi-objective settings, such as SFT leading to interference between conflicting objectives, while MoEs suffer from miscalibrated routing. We term this failure mode Axis Collapse, marked by (1) disjoint feature spaces causing catastrophic forgetting, and (2) unreliable inference from misrouted experts. To resolve this, we propose AlignX, a two-stage framework. Stage 1 uses prompt-injected fine-tuning to extract axis-specific task features, mitigating catastrophic forgetting. Stage 2 deploys a MoCaE module that calibrates expert routing using fractal and natural geometry, improving inference reliability. AlignX achieves significant gains on Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty), with +171.5% win rate, +110.1% in truthfulness-informativeness, and 4.3% fewer safety violations. It also reduces latency and memory usage by over 35% compared to prior MoEs. Results across four LLMs validate its generalizability.

When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified

TL;DR

This work tackles the challenge of aligning LLMs to human values across multiple objectives—helpfulness, harmlessness, and honesty—by identifying Axis Collapse, a systemic interference between objectives. It introduces AlignX, a two-stage framework: Stage 1 uses prompt-injected fine-tuning to extract axis-specific task-feature representations, mitigating catastrophic forgetting; Stage 2 employs Mixture of Calibrated Experts (MoCaE) with fractal and natural calibrators to achieve per-instance, geometry- and semantics-aware routing and output calibration. Empirically, AlignX yields substantial improvements on Alpaca, BeaverTails, and TruthfulQA (e.g., +171.5% win rate, +110.1% TI, and 4.3% fewer safety violations) and reduces latency and memory usage by over 35% relative to prior MoE-based methods, with strong generalization across four LLM backbones. This modular, scalable approach enables safer, more trustworthy open-source LLMs and provides a blueprint for robust multi-objective alignment in practical deployments.

Abstract

Large Language Models (LLMs) need to be in accordance with human values-being helpful, harmless, and honest (HHH)-is important for safe deployment. Existing works use Supervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs. However, these works face challenges in multi-objective settings, such as SFT leading to interference between conflicting objectives, while MoEs suffer from miscalibrated routing. We term this failure mode Axis Collapse, marked by (1) disjoint feature spaces causing catastrophic forgetting, and (2) unreliable inference from misrouted experts. To resolve this, we propose AlignX, a two-stage framework. Stage 1 uses prompt-injected fine-tuning to extract axis-specific task features, mitigating catastrophic forgetting. Stage 2 deploys a MoCaE module that calibrates expert routing using fractal and natural geometry, improving inference reliability. AlignX achieves significant gains on Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty), with +171.5% win rate, +110.1% in truthfulness-informativeness, and 4.3% fewer safety violations. It also reduces latency and memory usage by over 35% compared to prior MoEs. Results across four LLMs validate its generalizability.
Paper Structure (21 sections, 5 figures, 11 tables)

This paper contains 21 sections, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Illustration of Axis Collapse. Top: Two observed effects—catastrophic forgetting (blue) and miscalibrated expert routing (violet)—highlight the breakdown that occurs when alignment axes conflict at inference time. Bottom: In a naive setup (via LLaMA-2-7B), the helpfulness model (left, green dots) maintains clear feature boundaries. In contrast, the honesty model (right) shows collapsed structure, with green (helpfulness), red (harmlessness), and blue (honesty) points entangled—indicating interference between alignment objectives. This drift in representation space shows structural breakdown across axes, supporting the systemic nature of Axis Collapse.
  • Figure 2: Architecture of AlignX: a two-stage framework for multi-objective alignment. Stage 1 fine-tunes LLaMA-2-7B with prompt-injected datasets to compute task vectors and alignment-aware feature matrices, forming task-feature matrices. Stage 2 introduces the MoCaE module, which routes user queries to specialized experts and applies fractal and natural calibrators for geometric and semantic consistency. The final calibrated embedding is reinjected via the AlignX layer for axis-aware generation (blue: traditional, red: proposed).
  • Figure 3: Prompt injection templates used during alignment-specific fine-tuning. Each alignment axis is reinforced with a targeted helpful, harmless, or honest system prompt to steer model behavior before extracting task-feature matrices.
  • Figure 4: Ablation analysis of the AlignX on LLaMA-2-7B. Figures (a–c) show the individual contributions of alignment finetuning and the calibrators used in MoCaE. Figures (d–f) analyze expert behaviors: (d) shows performance variation across expert configurations, (e) illustrates activation probabilities assigned to each expert during routing, and (f) reveals how incoming query types influence expert activation (Inc refers to Incoming in graph).
  • Figure 5: Ablation analysis of calibration metrics via proposed MoCaE on LLaMA-2-7B under few‑shot and zero‑shot settings.