Dynamic Pyramid Network for Efficient Multimodal Large Language Model

Hao Ai; Kunyi Wang; Zezhou Wang; Hao Lu; Jin Tian; Yaxin Luo; Peng Xing; Jen-Yuan Huang; Huaxia Li; Gen luo

Dynamic Pyramid Network for Efficient Multimodal Large Language Model

Hao Ai, Kunyi Wang, Zezhou Wang, Hao Lu, Jin Tian, Yaxin Luo, Peng Xing, Jen-Yuan Huang, Huaxia Li, Gen luo

TL;DR

Dynamic Pyramid Network (DPN) tackles the high computation of multimodal LLMs by embedding a hierarchical, pyramid-like visual compression inside the LLM. It introduces Dynamic Pooling Experts (DPE) that select an optimal pooling rate per input via a routing mechanism guided by a routing loss, enabling sample-aware acceleration. Empirical results show up to a 56% FLOPs reduction on LLaVA with a small gain of +0.74%, and a 1.4x speedup with +0.62% gains on LLaVA-HR-X, with strong generalization to LLaVA-HR and extensive ablations confirming the effectiveness of dynamic, integrated pooling. The approach reuses existing pre-trained visual projectors, avoids multi-stage training, and provides a principled, scalable path to efficient vision-language inference in MLLMs.

Abstract

Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, e.g. efficient projectors, inevitably destroy the visual semantics in MLLM, especially in difficult samples. To overcome this shortcoming, we propose a novel dynamic pyramid network (DPN) for efficient MLLMs. Specifically, DPN formulates MLLM as a hierarchical structure where visual features are gradually compressed with increasing depth. In this case, even with a high compression ratio, fine-grained visual information can still be perceived in shallow layers. To maximize the benefit of DPN, we further propose an innovative Dynamic Pooling Experts (DPE) that can dynamically choose the optimal visual compression rate according to input features. With this design, harder samples will be assigned larger computations, thus preserving the model performance. To validate our approach, we conduct extensive experiments on two popular MLLMs and ten benchmarks. Experimental results show that DPN can save up to 56% average FLOPs on LLaVA while further achieving +0.74% performance gains. Besides, the generalization ability of DPN is also validated on the existing high-resolution MLLM called LLaVA-HR. The source code will be released at https://github.com/aihao2000/DPN-LLaVA.

Dynamic Pyramid Network for Efficient Multimodal Large Language Model

TL;DR

Abstract

Dynamic Pyramid Network for Efficient Multimodal Large Language Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)