Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment
Zhili Liu, Yunhao Gou, Kai Chen, Lanqing Hong, Jiahui Gao, Fei Mi, Yu Zhang, Zhenguo Li, Xin Jiang, Qun Liu, James T. Kwok
TL;DR
MoTE tackles the challenge of aligning LLMs with human values by integrating structured, multi-step reasoning with a step-level mixture-of-experts (MoE) architecture. It introduces a four-step reasoning chain—Question Analysis, Answer Guidance, Safe Answer, and Safety Checking—coupled with parallel LoRA experts dedicated to each step and a shared expert to promote cross-step collaboration, enabling adaptive inference lengths and improved safety even for smaller models. Experimental results on 7B and 8B models show substantial gains in safety and jailbreak resistance, with performance competitive to OpenAI’s o1 model on key benchmarks, alongside comprehensive ablations that highlight the contributions of data, step-level routing, and step skipping. The work provides both empirical evidence and theoretical rationale for decomposing alignment into intermediate steps and demonstrates a scalable, training-efficient path for robust self-alignment in LLMs. Overall, MoTE advances practical self-alignment by fusing reasoning-driven safety with parameter-efficient MoE, offering a foundation for extending safer AI across domains and modalities.
Abstract
As the capabilities of large language models (LLMs) continue to expand, aligning these models with human values remains a significant challenge. Recent studies show that reasoning abilities contribute significantly to model safety, while integrating Mixture-of-Experts (MoE) architectures can further enhance alignment. In this work, we address a fundamental question: How to effectively incorporate reasoning abilities and MoE architectures into self-alignment process in LLMs? We propose Mixture of insighTful Experts (MoTE), a novel framework that synergistically combines reasoning chains and expert mixtures to improve self-alignments. From a data perspective, MoTE employs a structured reasoning chain comprising four key stages: Question Analysis, Answer Guidance, Safe Answer, and Safety Checking. This approach enhances safety through multi-step reasoning and proves effective even for smaller and less powerful LLMs (e.g., 7B models). From an architectural perspective, MoTE adopts a multi-LoRA framework with step-level routing, where each expert is dedicated to a specific reasoning step. This design eliminates the need for balance losses, ensures stable training, and supports adaptive inference lengths. Experimental results demonstrate that MoTE significantly improves model safety, jailbreak resistance, and over-refusal capabilities, achieving performance comparable to OpenAI's state-of-the-art o1 model.
