H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs
Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Yichang Xu, Zachary Yahn, Ling Liu
TL;DR
H3Fusion presents a novel mixture-of-experts fusion to achieve Helpful, Harmless, and Honest alignment by combining three independently aligned LLMs into a single MoE-augmented model. It treats alignment as controllable embedding drift and introduces a drift-regularization loss and a gating loss to tune expert contributions, alongside a dual objective based on embedding-distance metrics. Across Alpaca-Eval, BeaverTails, and TruthfulQA, H3Fusion-MoE yields substantial improvements over individual experts and strong ensemble baselines, while maintaining a small fine-tuning footprint and favorable parameter efficiency. The work highlights the practical potential of MoE-based alignment fusion for robust, multi-task alignment in LLMs, with careful ablations and hyperparameter analyses demonstrating the method’s effectiveness and limitations.
Abstract
The alignment of pre-trained LLMs continues to draw significant attention from both industry and academia, aiming to ensure responses that are helpful, harmless, and honest. However, identifying a point in the model's representation subspace that simultaneously satisfies all these properties remains challenging. H3Fusion addresses this challenge by introducing a mixture-of-experts (MoE)-based fusion mechanism that models alignment as a controllable drift within the subspace, guided by a drift-regularization loss to balance competing alignment dimensions. Furthermore, we formulate the alignment by finding a dual objective of harnessing the distance of generated embeddings and alignment embeddings, and introduce a gating loss by canalizing the activations on the contributing experts. Extensive evaluations of three benchmark datasets show that H3Fusion is more helpful, less harmful, and more honest in three aspects: it outperforms each individually aligned model by 11.37%, and provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by 13.77% and model-merging approaches by 6.18%. Code is available at https://github.com/sftekin/h3fusion.
