Table of Contents
Fetching ...

Resolving Task Objective Conflicts in Unified Model via Task-Aware Mixture-of-Experts

Jiaxing Zhang, Hao Tang

TL;DR

This work tackles Task Objective Conflicts (TOC) in unified autoregressive multimodal models that must both understand (MMU) and generate (T2I). It introduces UniDecouple, a Task-Aware MoE (TA-MoE) framework with Hierarchical Expert Routing and Hybrid Expert Collaboration to create task-specific optimization subpaths, plus a two-stage training strategy that first specializes task-specific experts and then jointly fine-tunes with LoRA (rank $r=16$). Across extensive MMU and T2I benchmarks, UniDecouple mitigates negative transfer and catastrophic forgetting, achieving strong MMU performance while delivering T2I results on par with state-of-the-art methods, with ablations confirming the contributions of TA-MoE components and the training scheme. The approach demonstrates that explicitly disentangling task pathways within autoregressive architectures yields robust, scalable multimodal models suitable for real-world multimodal reasoning and generation tasks.

Abstract

Unified multimodal large language models (MLLMs) based on end-to-end autoregressive (AR) transformers effectively integrate both understanding and generation tasks within a single framework. However, intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges, often leading to suboptimal trade-offs and task interference. Existing solutions, such as decoupling shared visual encoders, fall short of fundamentally resolving these conflicts due to inherent AR architecture. In this paper, we propose a novel approach that decouples internal components of AR to resolve task objective conflicts. Specifically, we design UTAMoE, a Unified Task-Aware Mixture-of-Experts (MoE) framework that decouples internal AR modules via a Task-Aware MoE Layer to create task-specific optimization subpaths. To enhance task differentiation while maintaining overall coordination, we introduce a novel Two-Stage Training Strategy. Extensive experiments on multimodal benchmarks demonstrate that UTAMoE mitigates task objective conflicts, achieving state-of-the-art performance across various tasks. Visualizations and ablation studies further validate the effectiveness of our approach.

Resolving Task Objective Conflicts in Unified Model via Task-Aware Mixture-of-Experts

TL;DR

This work tackles Task Objective Conflicts (TOC) in unified autoregressive multimodal models that must both understand (MMU) and generate (T2I). It introduces UniDecouple, a Task-Aware MoE (TA-MoE) framework with Hierarchical Expert Routing and Hybrid Expert Collaboration to create task-specific optimization subpaths, plus a two-stage training strategy that first specializes task-specific experts and then jointly fine-tunes with LoRA (rank ). Across extensive MMU and T2I benchmarks, UniDecouple mitigates negative transfer and catastrophic forgetting, achieving strong MMU performance while delivering T2I results on par with state-of-the-art methods, with ablations confirming the contributions of TA-MoE components and the training scheme. The approach demonstrates that explicitly disentangling task pathways within autoregressive architectures yields robust, scalable multimodal models suitable for real-world multimodal reasoning and generation tasks.

Abstract

Unified multimodal large language models (MLLMs) based on end-to-end autoregressive (AR) transformers effectively integrate both understanding and generation tasks within a single framework. However, intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges, often leading to suboptimal trade-offs and task interference. Existing solutions, such as decoupling shared visual encoders, fall short of fundamentally resolving these conflicts due to inherent AR architecture. In this paper, we propose a novel approach that decouples internal components of AR to resolve task objective conflicts. Specifically, we design UTAMoE, a Unified Task-Aware Mixture-of-Experts (MoE) framework that decouples internal AR modules via a Task-Aware MoE Layer to create task-specific optimization subpaths. To enhance task differentiation while maintaining overall coordination, we introduce a novel Two-Stage Training Strategy. Extensive experiments on multimodal benchmarks demonstrate that UTAMoE mitigates task objective conflicts, achieving state-of-the-art performance across various tasks. Visualizations and ablation studies further validate the effectiveness of our approach.

Paper Structure

This paper contains 26 sections, 15 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison of different multimodal large language model (MLLM) architectures. (a) Single Image Encoder: pixel and semantic features are jointly encoded into a unified representation (e.g., Chameleon, EMU3, VILA-U, LWM, TokenFlow). (b) Decoupled Image Encoder: pixel-level and semantic-level features are separately processed and then integrated into the MLLM (e.g., Janus, JanusPro). (c) Decoupled AR (Ours): UniDecouple introduces task-specific experts via a Mixture-of-Experts (MoE) design to resolve conflicts between pixel-level understanding and semantic-level generation.
  • Figure 2: a) Overview of UniDecouple. "Und." and "Gen." denote understanding and generation. b) Task-Aware Mixture-of-Experts Layer. Hierarchical Expert Routing consists of a task-aware router and a dynamic-assignment (DA) router, while Hybrid expert collaboration refers to integration of task-specific experts and shared experts.
  • Figure 3: Two-Stage Training Strategy.
  • Figure 4: Loss dynamics during multi-task training. The left plot shows loss curves for understanding (cross-entropy) and generation (MSE) tasks. The right plot highlights their trade-off patterns.
  • Figure 5: Visualization of the average expert load distribution based on a random sample of 100 instances from MMU.
  • ...and 1 more figures