Table of Contents
Fetching ...

HiMix: Reducing Computational Complexity in Large Vision-Language Models

Xuange Zhang, Dengjie Li, Bo Liu, Zenghao Bao, Yao Zhou, Baisong Yang, Zhongying Liu, Yujie Zhong, Zheng Zhao, Tongtong Yuan

TL;DR

HiMix introduces a hierarchical vision injection mechanism and Mixture Attention to substantially reduce the computational load of large vision-language models (LVLMs). By ensuring only the language sequence propagates through the full forward pass and injecting vision information at selective stages, HiMix changes the standard cross-modal interaction from full sequence concatenation to a targeted, stage-wise interaction, yielding a decoder FLOP reduction of about $10\times$ with minimal performance loss. The key components—Hierarchical Vision Injection with dedicated vision projection layers and Mixture Attention that computes attention using a merged KV with $K_{vl}=[K_v;K_l]$ and $V_{vl}=[V_v;V_l]$—achieve a complexity of $O((N+M)Md) + O(8Md^2)$ for the language-decoder pathway, compared to the vanilla $O((N+M)^2 d) + O(8(N+M)d^2)$. Empirical results across TinyLlama, Qwen2, and Llama3 variants, using Siglip as the vision encoder and two training paradigms (Regular and Enhanced), show HiMix maintains competitive performance on most benchmarks (e.g., VQAv2, GQA, TextVQA, MM-Vet, POPE, MME, MMMU) while achieving substantial efficiency gains; extended evaluations with smaller vision encoders demonstrate potential deployment on resource-constrained devices and suggest scalability to larger LLMs with data augmentation.

Abstract

Benefiting from recent advancements in large language models and modality alignment techniques, existing Large Vision-Language Models(LVLMs) have achieved prominent performance across a wide range of scenarios. However, the excessive computational complexity limits the widespread use of these models in practical applications. We argue that one main bottleneck in computational complexity is caused by the involvement of redundant vision sequences in model computation. This is inspired by a reassessment of the efficiency of vision and language information transmission in the language decoder of LVLMs. Then, we propose a novel hierarchical vision-language interaction mechanism called Hierarchical Vision injection for Mixture Attention (HiMix). In HiMix, only the language sequence undergoes full forward propagation, while the vision sequence interacts with the language at specific stages within each language decoder layer. It is striking that our approach significantly reduces computational complexity with minimal performance loss. Specifically, HiMix achieves a 10x reduction in the computational cost of the language decoder across multiple LVLM models while maintaining comparable performance. This highlights the advantages of our method, and we hope our research brings new perspectives to the field of vision-language understanding. Project Page: https://xuange923.github.io/HiMix

HiMix: Reducing Computational Complexity in Large Vision-Language Models

TL;DR

HiMix introduces a hierarchical vision injection mechanism and Mixture Attention to substantially reduce the computational load of large vision-language models (LVLMs). By ensuring only the language sequence propagates through the full forward pass and injecting vision information at selective stages, HiMix changes the standard cross-modal interaction from full sequence concatenation to a targeted, stage-wise interaction, yielding a decoder FLOP reduction of about with minimal performance loss. The key components—Hierarchical Vision Injection with dedicated vision projection layers and Mixture Attention that computes attention using a merged KV with and —achieve a complexity of for the language-decoder pathway, compared to the vanilla . Empirical results across TinyLlama, Qwen2, and Llama3 variants, using Siglip as the vision encoder and two training paradigms (Regular and Enhanced), show HiMix maintains competitive performance on most benchmarks (e.g., VQAv2, GQA, TextVQA, MM-Vet, POPE, MME, MMMU) while achieving substantial efficiency gains; extended evaluations with smaller vision encoders demonstrate potential deployment on resource-constrained devices and suggest scalability to larger LLMs with data augmentation.

Abstract

Benefiting from recent advancements in large language models and modality alignment techniques, existing Large Vision-Language Models(LVLMs) have achieved prominent performance across a wide range of scenarios. However, the excessive computational complexity limits the widespread use of these models in practical applications. We argue that one main bottleneck in computational complexity is caused by the involvement of redundant vision sequences in model computation. This is inspired by a reassessment of the efficiency of vision and language information transmission in the language decoder of LVLMs. Then, we propose a novel hierarchical vision-language interaction mechanism called Hierarchical Vision injection for Mixture Attention (HiMix). In HiMix, only the language sequence undergoes full forward propagation, while the vision sequence interacts with the language at specific stages within each language decoder layer. It is striking that our approach significantly reduces computational complexity with minimal performance loss. Specifically, HiMix achieves a 10x reduction in the computational cost of the language decoder across multiple LVLM models while maintaining comparable performance. This highlights the advantages of our method, and we hope our research brings new perspectives to the field of vision-language understanding. Project Page: https://xuange923.github.io/HiMix
Paper Structure (21 sections, 7 equations, 12 figures, 10 tables, 1 algorithm)

This paper contains 21 sections, 7 equations, 12 figures, 10 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of performance and computational cost of the language decoder between the original and HiMix models. The circles, arranged from smallest to largest, represent the models Qwen2-0.5B, Llama3-1B, TinyLlama-1.1B, and Llama3-3B. While maintaining a similar performance to the original models, our HiMix achieves a $10\times$ reduction in computational cost.
  • Figure 2: Layer-wise Cosine Similarity of Vision and Language Sequences.
  • Figure 3: Comparison of Vanilla Model and HiMix Architectures. Left: Overall structure of traditional Vanilla. Middle: Overall structure of HiMix. Right: Details of HiMix.
  • Figure 4: Exploration of Model Architecture Design. (a) Uniform Vision Injection to Each Layer. (b) Hierarchical Vision Injection Through Multi-Level Connectors. The Mixture Attention (MA) differs from the main method in that the vision and language sequences share the KV projection layers
  • Figure 5: Performance gaps between HiMix and the baseline model across various benchmarks under two training strategies: Regular Paradigm and Enhanced Paradigm. The y-axis represents the performance gap relative to the baseline, with positive values indicating improvements. Subfigures (a), (b), and (c) show results for Qwen2-0.5B, Llama3-1B, and TinyLLaMA-1.1B, respectively. Results indicate that using the Enhanced Paradigm amplifies HiMix's performance benefits over the baseline.
  • ...and 7 more figures