Table of Contents
Fetching ...

Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment

Zhili Liu, Yunhao Gou, Kai Chen, Lanqing Hong, Jiahui Gao, Fei Mi, Yu Zhang, Zhenguo Li, Xin Jiang, Qun Liu, James T. Kwok

TL;DR

MoTE tackles the challenge of aligning LLMs with human values by integrating structured, multi-step reasoning with a step-level mixture-of-experts (MoE) architecture. It introduces a four-step reasoning chain—Question Analysis, Answer Guidance, Safe Answer, and Safety Checking—coupled with parallel LoRA experts dedicated to each step and a shared expert to promote cross-step collaboration, enabling adaptive inference lengths and improved safety even for smaller models. Experimental results on 7B and 8B models show substantial gains in safety and jailbreak resistance, with performance competitive to OpenAI’s o1 model on key benchmarks, alongside comprehensive ablations that highlight the contributions of data, step-level routing, and step skipping. The work provides both empirical evidence and theoretical rationale for decomposing alignment into intermediate steps and demonstrates a scalable, training-efficient path for robust self-alignment in LLMs. Overall, MoTE advances practical self-alignment by fusing reasoning-driven safety with parameter-efficient MoE, offering a foundation for extending safer AI across domains and modalities.

Abstract

As the capabilities of large language models (LLMs) continue to expand, aligning these models with human values remains a significant challenge. Recent studies show that reasoning abilities contribute significantly to model safety, while integrating Mixture-of-Experts (MoE) architectures can further enhance alignment. In this work, we address a fundamental question: How to effectively incorporate reasoning abilities and MoE architectures into self-alignment process in LLMs? We propose Mixture of insighTful Experts (MoTE), a novel framework that synergistically combines reasoning chains and expert mixtures to improve self-alignments. From a data perspective, MoTE employs a structured reasoning chain comprising four key stages: Question Analysis, Answer Guidance, Safe Answer, and Safety Checking. This approach enhances safety through multi-step reasoning and proves effective even for smaller and less powerful LLMs (e.g., 7B models). From an architectural perspective, MoTE adopts a multi-LoRA framework with step-level routing, where each expert is dedicated to a specific reasoning step. This design eliminates the need for balance losses, ensures stable training, and supports adaptive inference lengths. Experimental results demonstrate that MoTE significantly improves model safety, jailbreak resistance, and over-refusal capabilities, achieving performance comparable to OpenAI's state-of-the-art o1 model.

Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment

TL;DR

MoTE tackles the challenge of aligning LLMs with human values by integrating structured, multi-step reasoning with a step-level mixture-of-experts (MoE) architecture. It introduces a four-step reasoning chain—Question Analysis, Answer Guidance, Safe Answer, and Safety Checking—coupled with parallel LoRA experts dedicated to each step and a shared expert to promote cross-step collaboration, enabling adaptive inference lengths and improved safety even for smaller models. Experimental results on 7B and 8B models show substantial gains in safety and jailbreak resistance, with performance competitive to OpenAI’s o1 model on key benchmarks, alongside comprehensive ablations that highlight the contributions of data, step-level routing, and step skipping. The work provides both empirical evidence and theoretical rationale for decomposing alignment into intermediate steps and demonstrates a scalable, training-efficient path for robust self-alignment in LLMs. Overall, MoTE advances practical self-alignment by fusing reasoning-driven safety with parameter-efficient MoE, offering a foundation for extending safer AI across domains and modalities.

Abstract

As the capabilities of large language models (LLMs) continue to expand, aligning these models with human values remains a significant challenge. Recent studies show that reasoning abilities contribute significantly to model safety, while integrating Mixture-of-Experts (MoE) architectures can further enhance alignment. In this work, we address a fundamental question: How to effectively incorporate reasoning abilities and MoE architectures into self-alignment process in LLMs? We propose Mixture of insighTful Experts (MoTE), a novel framework that synergistically combines reasoning chains and expert mixtures to improve self-alignments. From a data perspective, MoTE employs a structured reasoning chain comprising four key stages: Question Analysis, Answer Guidance, Safe Answer, and Safety Checking. This approach enhances safety through multi-step reasoning and proves effective even for smaller and less powerful LLMs (e.g., 7B models). From an architectural perspective, MoTE adopts a multi-LoRA framework with step-level routing, where each expert is dedicated to a specific reasoning step. This design eliminates the need for balance losses, ensures stable training, and supports adaptive inference lengths. Experimental results demonstrate that MoTE significantly improves model safety, jailbreak resistance, and over-refusal capabilities, achieving performance comparable to OpenAI's state-of-the-art o1 model.
Paper Structure (38 sections, 7 equations, 7 figures, 6 tables)

This paper contains 38 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overall Framework of MoTE.(a): Example for 4-step reasoning chain, which serves as the training set of MoTE. (b): MoTE employs a multi-LoRA architecture and a shared expert, with each expert focusing on one aspect of the reasoning chain. MoTE both distinguishes each specialist and fosters collaboration among them.
  • Figure 2: Prompt templates for the reasoning chain. We instruct the model step-by-step by first analyzing the question, then guiding its outputs and outputting the final answer. Safety checking is operated on the final answer. The final constructed reasoning chain adds special trainable tokens for the start and end of each reasoning step. The dashed blocks mean that the middle steps can be skipped based on the question's difficulty.
  • Figure 3: Efficient step skipping through attention masking.
  • Figure 4: Qualitative comparison between different alignment methods.
  • Figure 5: Training Paradigms Comparison.Single Model trains with the reasoning chain with one model. Separate Models tune three models with each capable of analysis, guidance, and answer. MoTE, our proposed method excels across all metrics.
  • ...and 2 more figures