Table of Contents
Fetching ...

Dynamic Pyramid Network for Efficient Multimodal Large Language Model

Hao Ai, Kunyi Wang, Zezhou Wang, Hao Lu, Jin Tian, Yaxin Luo, Peng Xing, Jen-Yuan Huang, Huaxia Li, Gen luo

TL;DR

Dynamic Pyramid Network (DPN) tackles the high computation of multimodal LLMs by embedding a hierarchical, pyramid-like visual compression inside the LLM. It introduces Dynamic Pooling Experts (DPE) that select an optimal pooling rate per input via a routing mechanism guided by a routing loss, enabling sample-aware acceleration. Empirical results show up to a 56% FLOPs reduction on LLaVA with a small gain of +0.74%, and a 1.4x speedup with +0.62% gains on LLaVA-HR-X, with strong generalization to LLaVA-HR and extensive ablations confirming the effectiveness of dynamic, integrated pooling. The approach reuses existing pre-trained visual projectors, avoids multi-stage training, and provides a principled, scalable path to efficient vision-language inference in MLLMs.

Abstract

Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, e.g. efficient projectors, inevitably destroy the visual semantics in MLLM, especially in difficult samples. To overcome this shortcoming, we propose a novel dynamic pyramid network (DPN) for efficient MLLMs. Specifically, DPN formulates MLLM as a hierarchical structure where visual features are gradually compressed with increasing depth. In this case, even with a high compression ratio, fine-grained visual information can still be perceived in shallow layers. To maximize the benefit of DPN, we further propose an innovative Dynamic Pooling Experts (DPE) that can dynamically choose the optimal visual compression rate according to input features. With this design, harder samples will be assigned larger computations, thus preserving the model performance. To validate our approach, we conduct extensive experiments on two popular MLLMs and ten benchmarks. Experimental results show that DPN can save up to 56% average FLOPs on LLaVA while further achieving +0.74% performance gains. Besides, the generalization ability of DPN is also validated on the existing high-resolution MLLM called LLaVA-HR. The source code will be released at https://github.com/aihao2000/DPN-LLaVA.

Dynamic Pyramid Network for Efficient Multimodal Large Language Model

TL;DR

Dynamic Pyramid Network (DPN) tackles the high computation of multimodal LLMs by embedding a hierarchical, pyramid-like visual compression inside the LLM. It introduces Dynamic Pooling Experts (DPE) that select an optimal pooling rate per input via a routing mechanism guided by a routing loss, enabling sample-aware acceleration. Empirical results show up to a 56% FLOPs reduction on LLaVA with a small gain of +0.74%, and a 1.4x speedup with +0.62% gains on LLaVA-HR-X, with strong generalization to LLaVA-HR and extensive ablations confirming the effectiveness of dynamic, integrated pooling. The approach reuses existing pre-trained visual projectors, avoids multi-stage training, and provides a principled, scalable path to efficient vision-language inference in MLLMs.

Abstract

Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, e.g. efficient projectors, inevitably destroy the visual semantics in MLLM, especially in difficult samples. To overcome this shortcoming, we propose a novel dynamic pyramid network (DPN) for efficient MLLMs. Specifically, DPN formulates MLLM as a hierarchical structure where visual features are gradually compressed with increasing depth. In this case, even with a high compression ratio, fine-grained visual information can still be perceived in shallow layers. To maximize the benefit of DPN, we further propose an innovative Dynamic Pooling Experts (DPE) that can dynamically choose the optimal visual compression rate according to input features. With this design, harder samples will be assigned larger computations, thus preserving the model performance. To validate our approach, we conduct extensive experiments on two popular MLLMs and ten benchmarks. Experimental results show that DPN can save up to 56% average FLOPs on LLaVA while further achieving +0.74% performance gains. Besides, the generalization ability of DPN is also validated on the existing high-resolution MLLM called LLaVA-HR. The source code will be released at https://github.com/aihao2000/DPN-LLaVA.

Paper Structure

This paper contains 20 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of existing plain MLLMs with our dynamic pyramid network (DPN). Compared to plain MLLMs, our DPN can dynamically compress the input tokens in a hierarchical manner while greatly preventing the model performance.
  • Figure 2: Illustration of Dynamic Pyramid Network (DPN) and its Dynamic Pooling Experts (DPE). DPN formulates the common LLM as a dynamic pyramid structure, and the visual tokens will be progressively pooled via the DPE. In practice, DPE can dynamically select an optimal pooling kernel for visual compression, thus achieving the best trade-off between efficiency and performance.
  • Figure 3: Statistics of expert activation for different datasets. our DPN can dynamically select the pooling kernel according to the task difficulty.
  • Figure 4: Visualization Results. Sub-figure A illustrates the routing of the DPN method across different visual tasks. Sub-figure B presents a visual comparison with the sparsification method FastV and the efficient projector-based approach TokenPacker. Our method demonstrates significant advantages in tasks such as optical character recognition (OCR), chart interpretation, and object detection.