Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

Longrong Yang; Dong Shen; Chaoxiang Cai; Fan Yang; Tingting Gao; Di Zhang; Xi Li

Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

Longrong Yang, Dong Shen, Chaoxiang Cai, Fan Yang, Tingting Gao, Di Zhang, Xi Li

TL;DR

This work tackles gradient interference among tokens routed to the same MoE expert in large vision-language models. It introduces STGC, which identifies conflicting tokens via token-level gradients and mitigates interference with a conflict-elimination loss that re-routes conflicting tokens to other experts, all while preserving a balanced load. The approach is a versatile plug-in for MoE-based LVLMs and yields consistent gains on image question answering, visual reasoning benchmarks, and even language tasks, with manageable training overhead. By demonstrating improved gradient consistency and reduced token conflicts, STGC advances efficient MoE utilization and expert specialization in diverse, data-rich LVLM settings.

Abstract

The Mixture-of-Experts (MoE) has gained increasing attention in studying Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLM encourage different experts to specialize in different tokens, and they usually employ a router to predict the routing of each token. However, the router is not optimized concerning distinct parameter optimization directions generated from tokens within an expert. This may lead to severe interference between tokens within an expert. To address this problem, we propose to use the token-level gradient analysis to Solving Token Gradient Conflict (STGC) in this paper. Specifically, we first use token-level gradients to identify conflicting tokens in experts. After that, we add a regularization loss tailored to encourage conflicting tokens routing from their current experts to other experts, for reducing interference between tokens within an expert. Our method can serve as a plug-in for diverse LVLM methods, and extensive experimental results demonstrate its effectiveness. The code will be publicly available at https://github.com/longrongyang/STGC.

Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

TL;DR

Abstract

Paper Structure (29 sections, 17 equations, 9 figures, 14 tables)

This paper contains 29 sections, 17 equations, 9 figures, 14 tables.

Introduction
Related Works
Large Vision-language Model
Mixture-of-Experts (MoE)
Methodology
Overview
Conflicting Token Identification
Conflict Elimination Loss
Total Loss
Experiments
Experimental Setup
Image Understanding Evaluation
Ablation Study
Conclusion and Limitations
Acknowledgements.
...and 14 more sections

Figures (9)

Figure 1: (a) In this work, we aim to solve data interference by adjusting token routing to reduce gradient conflicts. (b) We present statistics regarding gradient consistency (the mean cosine similarity between gradients of all tokens within an expert). In experiments, we fed one sample into the LVLM per device for each forward pass. The baseline LVLM is MoE-LLaVA lin2024moellava.
Figure 2: Our pipeline. (a) Conflicting Token Identification. When the gradient of a token has a sufficiently low cosine similarity to the average gradient of its assigned expert, this token is marked as a conflicting token (an outlier for the expert). (b) Conflict Elimination Loss. We propose a loss aimed at encouraging the routing of conflicting tokens from their current experts to other experts.
Figure 3: Statistical verification. We conduct a deep analysis of the role of STGC. "Baseline" indicates MoE-LLaVA. "Baseline + STGC" indicates our method. (a) We compute a novel metric, gradient consistency (the mean cosine similarity between gradients of all tokens within an expert), for verifying that the decrease of the proposed loss leads to the more consistent token gradients within an expert. (b) We further analyze the gradient consistency on different layers.
Figure 4: Load balance loss. "Baseline" indicates MoE-LLaVA. "Baseline+STGC" indicates our method. We present the load balancing loss curve before and after adding STGC. The results are obtained from the regular training. The total training step count is 5194 for an epoch. When the load balancing loss is lower, the expert load is more balanced.
Figure 5: Expert Loading and activated pathways. The configure MoE-LLaVA-4Top2 with StableLM-1.6B is set for experiments We select three validation datasets, i.e., SQA lu2022learn, TextVQA singh2019textvqa, and MMBench liu2023mmbench, to analyze expert loading and activated pathways. In activated pathways, the colorful paths represent the top-2 paths for text and image, respectively, while the gray paths represent the remaining 8 paths.
...and 4 more figures

Theorems & Definitions (1)

Definition 1: Conflicting Token

Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

TL;DR

Abstract

Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (9)

Theorems & Definitions (1)