Table of Contents
Fetching ...

H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Yichang Xu, Zachary Yahn, Ling Liu

TL;DR

H3Fusion presents a novel mixture-of-experts fusion to achieve Helpful, Harmless, and Honest alignment by combining three independently aligned LLMs into a single MoE-augmented model. It treats alignment as controllable embedding drift and introduces a drift-regularization loss and a gating loss to tune expert contributions, alongside a dual objective based on embedding-distance metrics. Across Alpaca-Eval, BeaverTails, and TruthfulQA, H3Fusion-MoE yields substantial improvements over individual experts and strong ensemble baselines, while maintaining a small fine-tuning footprint and favorable parameter efficiency. The work highlights the practical potential of MoE-based alignment fusion for robust, multi-task alignment in LLMs, with careful ablations and hyperparameter analyses demonstrating the method’s effectiveness and limitations.

Abstract

The alignment of pre-trained LLMs continues to draw significant attention from both industry and academia, aiming to ensure responses that are helpful, harmless, and honest. However, identifying a point in the model's representation subspace that simultaneously satisfies all these properties remains challenging. H3Fusion addresses this challenge by introducing a mixture-of-experts (MoE)-based fusion mechanism that models alignment as a controllable drift within the subspace, guided by a drift-regularization loss to balance competing alignment dimensions. Furthermore, we formulate the alignment by finding a dual objective of harnessing the distance of generated embeddings and alignment embeddings, and introduce a gating loss by canalizing the activations on the contributing experts. Extensive evaluations of three benchmark datasets show that H3Fusion is more helpful, less harmful, and more honest in three aspects: it outperforms each individually aligned model by 11.37%, and provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by 13.77% and model-merging approaches by 6.18%. Code is available at https://github.com/sftekin/h3fusion.

H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

TL;DR

H3Fusion presents a novel mixture-of-experts fusion to achieve Helpful, Harmless, and Honest alignment by combining three independently aligned LLMs into a single MoE-augmented model. It treats alignment as controllable embedding drift and introduces a drift-regularization loss and a gating loss to tune expert contributions, alongside a dual objective based on embedding-distance metrics. Across Alpaca-Eval, BeaverTails, and TruthfulQA, H3Fusion-MoE yields substantial improvements over individual experts and strong ensemble baselines, while maintaining a small fine-tuning footprint and favorable parameter efficiency. The work highlights the practical potential of MoE-based alignment fusion for robust, multi-task alignment in LLMs, with careful ablations and hyperparameter analyses demonstrating the method’s effectiveness and limitations.

Abstract

The alignment of pre-trained LLMs continues to draw significant attention from both industry and academia, aiming to ensure responses that are helpful, harmless, and honest. However, identifying a point in the model's representation subspace that simultaneously satisfies all these properties remains challenging. H3Fusion addresses this challenge by introducing a mixture-of-experts (MoE)-based fusion mechanism that models alignment as a controllable drift within the subspace, guided by a drift-regularization loss to balance competing alignment dimensions. Furthermore, we formulate the alignment by finding a dual objective of harnessing the distance of generated embeddings and alignment embeddings, and introduce a gating loss by canalizing the activations on the contributing experts. Extensive evaluations of three benchmark datasets show that H3Fusion is more helpful, less harmful, and more honest in three aspects: it outperforms each individually aligned model by 11.37%, and provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by 13.77% and model-merging approaches by 6.18%. Code is available at https://github.com/sftekin/h3fusion.

Paper Structure

This paper contains 31 sections, 13 equations, 7 figures, 18 tables.

Figures (7)

  • Figure 1: The main framework for H3Fusion (MoE)
  • Figure 2: The left two figures show the effect of Gate Loss and the right two show the effect of Regularization Loss. plots shows the average weight assigned by the router to each expert. The 2nd figure shows the activity change based on the incoming datasets due to gating loss. The 4th figure shows the regularization effect.
  • Figure 3: The effect of # of fine-tuning steps during the alignment of H3Fusion is shown in the first plot. The second plot shows the performance change due to number of experts, $k$, activated by the router. We show the sensitivity analysis in the last two plots by observing the performance change on each property based on the change of gating loss weight $\lambda$ and regularization weights $\gamma$.
  • Figure 4: We show the hidden-embeddings for 100 samples using t-SNE van2008visualizing. Here, $d$ represents average L2 distance to base model.
  • Figure 5: Example prompt for H3Fusion (Instruct)
  • ...and 2 more figures