Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference
Siyuan Wang, Dianyi Wang, Chengxing Zhou, Zejun Li, Zhihao Fan, Xuanjing Huang, Zhongyu Wei
TL;DR
The paper identifies a dedicated visual region within LLM cores and demonstrates that selectively updating sparsely distributed layers (about 25% of layers) during supervised fine-tuning can preserve roughly 99% of visual performance while maintaining textual capabilities and reducing training time. It further introduces a visual region-based pruning paradigm that removes non-critical layers outside the region with minimal performance loss, proving effective across multiple LVLMs and scales and offering a practical, complementary path to existing efficient-training methods like LoRA. Extensive experiments across Bunny-Llama-3-8B-V, LLaVA-1.5-7B/13B, and Bunny-Phi3-mini-4B-V validate both the training efficiency and cross-model generalizability, including assessments on visual perception and cognition tasks as well as textual evaluations. The approach provides a scalable framework for efficient LVLM training and inference, with potential extensions to sparse architectures and broader modalities in future work.
Abstract
Large Vision-Language Models (LVLMs) typically learn visual capacity through visual instruction tuning, involving updates to both a projector and their LLM backbones. Inspired by the concept of a visual region in the human brain, we investigate the existence of an analogous \textit{visual region} within LLMs that functions as a cognitive core, and explore the potential of efficient training of LVLMs via selective layers tuning. Using Bunny-Llama-3-8B-V for detailed analysis and other three LVLMs for validation across diverse visual and textual tasks, we find that selectively updating 25\% of LLMs layers, when sparsely and uniformly distributed, can preserve nearly 99\% of visual performance and maintain or improve textual task results, while effectively reducing training time. Based on this targeted training approach, we further propose a novel visual region-based pruning paradigm, removing non-critical layers outside the visual region, which can achieve minimal performance loss. This study offers an effective and efficient strategy for LVLM training and inference by activating a layer-wise visual region within LLMs, which proves consistently effective across different models.
