Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model
Longrong Yang, Dong Shen, Chaoxiang Cai, Fan Yang, Tingting Gao, Di Zhang, Xi Li
TL;DR
This work tackles gradient interference among tokens routed to the same MoE expert in large vision-language models. It introduces STGC, which identifies conflicting tokens via token-level gradients and mitigates interference with a conflict-elimination loss that re-routes conflicting tokens to other experts, all while preserving a balanced load. The approach is a versatile plug-in for MoE-based LVLMs and yields consistent gains on image question answering, visual reasoning benchmarks, and even language tasks, with manageable training overhead. By demonstrating improved gradient consistency and reduced token conflicts, STGC advances efficient MoE utilization and expert specialization in diverse, data-rich LVLM settings.
Abstract
The Mixture-of-Experts (MoE) has gained increasing attention in studying Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLM encourage different experts to specialize in different tokens, and they usually employ a router to predict the routing of each token. However, the router is not optimized concerning distinct parameter optimization directions generated from tokens within an expert. This may lead to severe interference between tokens within an expert. To address this problem, we propose to use the token-level gradient analysis to Solving Token Gradient Conflict (STGC) in this paper. Specifically, we first use token-level gradients to identify conflicting tokens in experts. After that, we add a regularization loss tailored to encourage conflicting tokens routing from their current experts to other experts, for reducing interference between tokens within an expert. Our method can serve as a plug-in for diverse LVLM methods, and extensive experimental results demonstrate its effectiveness. The code will be publicly available at https://github.com/longrongyang/STGC.
