Table of Contents
Fetching ...

OrdMoE: Preference Alignment via Hierarchical Expert Group Ranking in Multimodal Mixture-of-Experts LLMs

Yuting Gao, Weihao Chen, Lan Wang, Ruihan Xu, Qingpei Guo

TL;DR

OrdMoE is proposed, a novel preference alignment framework that bypasses the reliance on external human preferences entirely by leveraging intrinsic signals within Mixture-of-Experts (MoE) architectures to achieve competitive results without requiring any human-annotated preference data.

Abstract

Preference learning has recently emerged as a pivotal strategy for post-training alignment of Multimodal Large Language Models (MLLMs). However, existing approaches predominantly rely on external human-annotated preference data, which is costly and labor-intensive to collect. In this work, we propose OrdMoE, a novel preference alignment framework that bypasses the reliance on external human preferences entirely by leveraging intrinsic signals within Mixture-of-Experts (MoE) architectures. Specifically, we observe that the router's expert selection scores implicitly encode a quality-aware ranking of responses (i.e. higher-scoring experts consistently generate higher-quality outputs). Building on this insight, OrdMoE constructs an internal preference hierarchy by grouping experts into ranked tiers based on their per-token routing scores and activating each tier separately to produce a sequence of responses with increasing quality. This yields a zero-cost, self-supervised preference ordering over generated responses, which can be directly optimized using standard preference learning objectives. Extensive experiments across multiple multimodal benchmarks demnstrate that OrdMoE significantly enhances both alignment and overall performance of multimodal Mixture-of-Experts LLMs, achieving competitive results without requiring any human-annotated preference data.

OrdMoE: Preference Alignment via Hierarchical Expert Group Ranking in Multimodal Mixture-of-Experts LLMs

TL;DR

OrdMoE is proposed, a novel preference alignment framework that bypasses the reliance on external human preferences entirely by leveraging intrinsic signals within Mixture-of-Experts (MoE) architectures to achieve competitive results without requiring any human-annotated preference data.

Abstract

Preference learning has recently emerged as a pivotal strategy for post-training alignment of Multimodal Large Language Models (MLLMs). However, existing approaches predominantly rely on external human-annotated preference data, which is costly and labor-intensive to collect. In this work, we propose OrdMoE, a novel preference alignment framework that bypasses the reliance on external human preferences entirely by leveraging intrinsic signals within Mixture-of-Experts (MoE) architectures. Specifically, we observe that the router's expert selection scores implicitly encode a quality-aware ranking of responses (i.e. higher-scoring experts consistently generate higher-quality outputs). Building on this insight, OrdMoE constructs an internal preference hierarchy by grouping experts into ranked tiers based on their per-token routing scores and activating each tier separately to produce a sequence of responses with increasing quality. This yields a zero-cost, self-supervised preference ordering over generated responses, which can be directly optimized using standard preference learning objectives. Extensive experiments across multiple multimodal benchmarks demnstrate that OrdMoE significantly enhances both alignment and overall performance of multimodal Mixture-of-Experts LLMs, achieving competitive results without requiring any human-annotated preference data.

Paper Structure

This paper contains 34 sections, 12 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Comparison of preference learning paradigms: (a) output-centric (e.g., DPO) requires human-labeled preferences; (b) input-centric methods rely on prompt variations(e.g., mDPO); (c) our OrdMoE uses identical inputs and outputs, but exploits the MoE router’s intrinsic signal to construct a self-supervised expert ranking.
  • Figure 2: Overall architecture of our OrdMoE training framework. The router computes routing probabilities and logically groups the selected experts into $\text{Group}_1 ...\text{Group}_K$... $\text{Group}_C$, corresponding to the highest, intermediate, and lower probabilities, respectively. The colored experts in each group denote this logical grouping: e.g., $\blacksquare$ (red-shaded) for $\text{Group}_1$, $\blacksquare$ (teal-shaded) for $\text{Group}_K$, and $\blacksquare$ (purple-shaded) for $\text{Group}_C$. Critically, the preference for the token are sequentially reduced from $\text{Group}_1$ to $\text{Group}_C$, directly reflecting the decreasing expert routing probabilities inherent to the OrdMoE design.
  • Figure 3: Comparing results between OrdMoE and Baseline. Our method effectively enhances the general performance of the base model, demonstrating significant improvements in localization precision, fine-grained recognition, and OCR capabilities.