Table of Contents
Fetching ...

Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference

Siyuan Wang, Dianyi Wang, Chengxing Zhou, Zejun Li, Zhihao Fan, Xuanjing Huang, Zhongyu Wei

TL;DR

The paper identifies a dedicated visual region within LLM cores and demonstrates that selectively updating sparsely distributed layers (about 25% of layers) during supervised fine-tuning can preserve roughly 99% of visual performance while maintaining textual capabilities and reducing training time. It further introduces a visual region-based pruning paradigm that removes non-critical layers outside the region with minimal performance loss, proving effective across multiple LVLMs and scales and offering a practical, complementary path to existing efficient-training methods like LoRA. Extensive experiments across Bunny-Llama-3-8B-V, LLaVA-1.5-7B/13B, and Bunny-Phi3-mini-4B-V validate both the training efficiency and cross-model generalizability, including assessments on visual perception and cognition tasks as well as textual evaluations. The approach provides a scalable framework for efficient LVLM training and inference, with potential extensions to sparse architectures and broader modalities in future work.

Abstract

Large Vision-Language Models (LVLMs) typically learn visual capacity through visual instruction tuning, involving updates to both a projector and their LLM backbones. Inspired by the concept of a visual region in the human brain, we investigate the existence of an analogous \textit{visual region} within LLMs that functions as a cognitive core, and explore the potential of efficient training of LVLMs via selective layers tuning. Using Bunny-Llama-3-8B-V for detailed analysis and other three LVLMs for validation across diverse visual and textual tasks, we find that selectively updating 25\% of LLMs layers, when sparsely and uniformly distributed, can preserve nearly 99\% of visual performance and maintain or improve textual task results, while effectively reducing training time. Based on this targeted training approach, we further propose a novel visual region-based pruning paradigm, removing non-critical layers outside the visual region, which can achieve minimal performance loss. This study offers an effective and efficient strategy for LVLM training and inference by activating a layer-wise visual region within LLMs, which proves consistently effective across different models.

Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference

TL;DR

The paper identifies a dedicated visual region within LLM cores and demonstrates that selectively updating sparsely distributed layers (about 25% of layers) during supervised fine-tuning can preserve roughly 99% of visual performance while maintaining textual capabilities and reducing training time. It further introduces a visual region-based pruning paradigm that removes non-critical layers outside the region with minimal performance loss, proving effective across multiple LVLMs and scales and offering a practical, complementary path to existing efficient-training methods like LoRA. Extensive experiments across Bunny-Llama-3-8B-V, LLaVA-1.5-7B/13B, and Bunny-Phi3-mini-4B-V validate both the training efficiency and cross-model generalizability, including assessments on visual perception and cognition tasks as well as textual evaluations. The approach provides a scalable framework for efficient LVLM training and inference, with potential extensions to sparse architectures and broader modalities in future work.

Abstract

Large Vision-Language Models (LVLMs) typically learn visual capacity through visual instruction tuning, involving updates to both a projector and their LLM backbones. Inspired by the concept of a visual region in the human brain, we investigate the existence of an analogous \textit{visual region} within LLMs that functions as a cognitive core, and explore the potential of efficient training of LVLMs via selective layers tuning. Using Bunny-Llama-3-8B-V for detailed analysis and other three LVLMs for validation across diverse visual and textual tasks, we find that selectively updating 25\% of LLMs layers, when sparsely and uniformly distributed, can preserve nearly 99\% of visual performance and maintain or improve textual task results, while effectively reducing training time. Based on this targeted training approach, we further propose a novel visual region-based pruning paradigm, removing non-critical layers outside the visual region, which can achieve minimal performance loss. This study offers an effective and efficient strategy for LVLM training and inference by activating a layer-wise visual region within LLMs, which proves consistently effective across different models.

Paper Structure

This paper contains 25 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Left: Perplexity of LLaVA with selected layers (in parentheses) reverted to Vicuna parameters on visual and textual tasks. Arrows indicate perplexity increases relative to LLaVA (visual tasks) and Vicuna (textual tasks). (1) Perplexity increases in textual tasks after multimodal training compared to the LLM backbone, indicating multimodal training compromises LLMs' linguistic abilities. (2) Perplexity decreases in visual tasks reverting certain layers (e.g., reverting layers 16–23 or 24-31 in LLaVA), suggesting these layers are redundant. Right: Accuracy of LLaVA-1.5-7B when pruning certain layers based on angular distance scores gromov2024unreasonable.
  • Figure 2: Performance variation of the re-trained Bunny-Llama-3-8B-V model across different training data scales during the supervised fine-tuning stage, with tuning varying number of layers. Dashed lines indicate 98% of the performance achieved by tuning all layers with the corresponding training data scale.
  • Figure 3: Computational costs for tuning LLaVA-1.5-7B, Bunny-Llama-3-8B-V, and LLaVA-1.5-13B with different number of layers using LoRA.
  • Figure 4: Results of pruning LLaVA-1.5-7B using angular distance-based strategy with 0$\sim$4 layers removed. Dashed lines represent pruning applied to the fully trained model while solid layers denote our visual region-based pruning within the targeted trained model.
  • Figure 5: Visualization of Image Attention Scores for every instances across all layers