Resolving Task Objective Conflicts in Unified Model via Task-Aware Mixture-of-Experts
Jiaxing Zhang, Hao Tang
TL;DR
This work tackles Task Objective Conflicts (TOC) in unified autoregressive multimodal models that must both understand (MMU) and generate (T2I). It introduces UniDecouple, a Task-Aware MoE (TA-MoE) framework with Hierarchical Expert Routing and Hybrid Expert Collaboration to create task-specific optimization subpaths, plus a two-stage training strategy that first specializes task-specific experts and then jointly fine-tunes with LoRA (rank $r=16$). Across extensive MMU and T2I benchmarks, UniDecouple mitigates negative transfer and catastrophic forgetting, achieving strong MMU performance while delivering T2I results on par with state-of-the-art methods, with ablations confirming the contributions of TA-MoE components and the training scheme. The approach demonstrates that explicitly disentangling task pathways within autoregressive architectures yields robust, scalable multimodal models suitable for real-world multimodal reasoning and generation tasks.
Abstract
Unified multimodal large language models (MLLMs) based on end-to-end autoregressive (AR) transformers effectively integrate both understanding and generation tasks within a single framework. However, intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges, often leading to suboptimal trade-offs and task interference. Existing solutions, such as decoupling shared visual encoders, fall short of fundamentally resolving these conflicts due to inherent AR architecture. In this paper, we propose a novel approach that decouples internal components of AR to resolve task objective conflicts. Specifically, we design UTAMoE, a Unified Task-Aware Mixture-of-Experts (MoE) framework that decouples internal AR modules via a Task-Aware MoE Layer to create task-specific optimization subpaths. To enhance task differentiation while maintaining overall coordination, we introduce a novel Two-Stage Training Strategy. Extensive experiments on multimodal benchmarks demonstrate that UTAMoE mitigates task objective conflicts, achieving state-of-the-art performance across various tasks. Visualizations and ablation studies further validate the effectiveness of our approach.
