Table of Contents
Fetching ...

$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

Yaxin Luo, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen, Rongrong Ji

TL;DR

This work tackles the efficiency bottleneck of multimodal large language models by introducing γ-MoD, a plug-in mixture-of-depth adaptation guided by a novel redundancy metric ARank. ARank identifies layers where token-level attention is sufficiently redundant to justify replacing dense computations with MoD layers, while a shared vision-language router and masked routing learning maximize sparsity without sacrificing performance. Empirical results on 9 vision-language benchmarks show substantial reductions in training and inference costs (e.g., up to ~50% Flops and ~50% inference time) with only modest accuracy declines (around 1–2%), and strong generalization across MLLM architectures and scales. The approach demonstrates that converting a large portion of dense layers to MoD layers is feasible and beneficial for practical deployment of MLLMs.

Abstract

Despite the significant progress in multimodal large language models (MLLMs), their high computational cost remains a barrier to real-world deployment. Inspired by the mixture of depths (MoDs) in natural language processing, we aim to address this limitation from the perspective of ``activated tokens''. Our key insight is that if most tokens are redundant for the layer computation, then can be skipped directly via the MoD layer. However, directly converting the dense layers of MLLMs to MoD layers leads to substantial performance degradation. To address this issue, we propose an innovative MoD adaptation strategy for existing MLLMs called $γ$-MoD. In $γ$-MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM, namely rank of attention maps (ARank). Through ARank, we can effectively identify which layer is redundant and should be replaced with the MoD layer. Based on ARank, we further propose two novel designs to maximize the computational sparsity of MLLM while maintaining its performance, namely shared vision-language router and masked routing learning. With these designs, more than 90% dense layers of the MLLM can be effectively converted to the MoD ones. To validate our method, we apply it to three popular MLLMs, and conduct extensive experiments on 9 benchmark datasets. Experimental results not only validate the significant efficiency benefit of $γ$-MoD to existing MLLMs but also confirm its generalization ability on various MLLMs. For example, with a minor performance drop, i.e., -1.5%, $γ$-MoD can reduce the training and inference time of LLaVA-HR by 31.0% and 53.2%, respectively.

$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

TL;DR

This work tackles the efficiency bottleneck of multimodal large language models by introducing γ-MoD, a plug-in mixture-of-depth adaptation guided by a novel redundancy metric ARank. ARank identifies layers where token-level attention is sufficiently redundant to justify replacing dense computations with MoD layers, while a shared vision-language router and masked routing learning maximize sparsity without sacrificing performance. Empirical results on 9 vision-language benchmarks show substantial reductions in training and inference costs (e.g., up to ~50% Flops and ~50% inference time) with only modest accuracy declines (around 1–2%), and strong generalization across MLLM architectures and scales. The approach demonstrates that converting a large portion of dense layers to MoD layers is feasible and beneficial for practical deployment of MLLMs.

Abstract

Despite the significant progress in multimodal large language models (MLLMs), their high computational cost remains a barrier to real-world deployment. Inspired by the mixture of depths (MoDs) in natural language processing, we aim to address this limitation from the perspective of ``activated tokens''. Our key insight is that if most tokens are redundant for the layer computation, then can be skipped directly via the MoD layer. However, directly converting the dense layers of MLLMs to MoD layers leads to substantial performance degradation. To address this issue, we propose an innovative MoD adaptation strategy for existing MLLMs called -MoD. In -MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM, namely rank of attention maps (ARank). Through ARank, we can effectively identify which layer is redundant and should be replaced with the MoD layer. Based on ARank, we further propose two novel designs to maximize the computational sparsity of MLLM while maintaining its performance, namely shared vision-language router and masked routing learning. With these designs, more than 90% dense layers of the MLLM can be effectively converted to the MoD ones. To validate our method, we apply it to three popular MLLMs, and conduct extensive experiments on 9 benchmark datasets. Experimental results not only validate the significant efficiency benefit of -MoD to existing MLLMs but also confirm its generalization ability on various MLLMs. For example, with a minor performance drop, i.e., -1.5%, -MoD can reduce the training and inference time of LLaVA-HR by 31.0% and 53.2%, respectively.

Paper Structure

This paper contains 18 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Visualization of attention maps in the MLLM and comparison of MoE with MoD. (a) Lower-rank layers often exhibit redundancy in their attention computation. (b) Different from MoE, MoD achieves the computational sparsity from the perspective of "activated token", where the computational budget is dynamically allocated to each token.
  • Figure 2: Illustration of our $\gamma$-MoD adaptation on LLaVA-HR.$\gamma$-MoD is a plug-and-play approach that can be directly applied in existing MLLMs. After vision-language alignment, $\gamma$-MoD can replace most redundant layers with MoD ones via the rank-based redundancy estimation.
  • Figure 3: Visualization of ARank based on different tasks (left) and sample sizes (right). The horizontal axis represents the layer index of LLaVA-HR. The darker color indicates the larger ARank.
  • Figure 4: Visualization of routing results for different MoD layers. "Q", "I" and "A" denote the question, image and response, respectively. The skipped tokens in sub-figure (b) are colored in gray.