Table of Contents
Fetching ...

Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models

Xu Li, Yi Zheng, Haotian Chen, Xiaolei Chen, Yuxuan Liang, Chenghang Lai, Bin Li, Xiangyang Xue

TL;DR

The paper addresses the limitation of LVLMs relying predominantly on final-layer vision features by systematically evaluating hierarchical visual features across 18 benchmarks. It introduces the instruction-guided vision aggregator (IGVA), a lightweight module that dynamically weights multi-layer visual features based on textual instructions, preserving visual tokens and integrated into the LLaVA-v1.5 framework. Through comprehensive experiments and ablations, the authors show that mid-to-high level features are crucial for semantic tasks while low-level features support fine-grained perception, and that task-aware dynamic fusion outperforms static fusion and existing multi-layer methods. The proposed approach achieves state-of-the-art or competitive results across a broad set of VQA and LVLM-specific benchmarks with better data efficiency, highlighting the practical impact of instruction-guided feature fusion for adaptable, multi-task vision-language reasoning.

Abstract

Large Vision-Language Models (LVLMs) have achieved remarkable success in a wide range of multimodal tasks by integrating pre-trained vision encoders and large language models. However, current LVLMs primarily rely on visual features extracted from the final layers of the vision encoder, overlooking the complementary information available in shallower layers. While recent approaches have explored the use of multilayer visual features in LVLMs, they tend to be task-agnostic and fail to examine the dependencies of hierarchical visual features on specific tasks. To address these gaps, we systematically investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories. Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance. Building on these insights, we propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions, without increasing the number of visual tokens. Extensive evaluations demonstrate the superior performance of our method. Additionally, an in-depth analysis of the aggregator's behavior highlights the dominance of mid-to-high-level features in semantic-rich tasks and the critical role of low-level features in fine-grained perception.

Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models

TL;DR

The paper addresses the limitation of LVLMs relying predominantly on final-layer vision features by systematically evaluating hierarchical visual features across 18 benchmarks. It introduces the instruction-guided vision aggregator (IGVA), a lightweight module that dynamically weights multi-layer visual features based on textual instructions, preserving visual tokens and integrated into the LLaVA-v1.5 framework. Through comprehensive experiments and ablations, the authors show that mid-to-high level features are crucial for semantic tasks while low-level features support fine-grained perception, and that task-aware dynamic fusion outperforms static fusion and existing multi-layer methods. The proposed approach achieves state-of-the-art or competitive results across a broad set of VQA and LVLM-specific benchmarks with better data efficiency, highlighting the practical impact of instruction-guided feature fusion for adaptable, multi-task vision-language reasoning.

Abstract

Large Vision-Language Models (LVLMs) have achieved remarkable success in a wide range of multimodal tasks by integrating pre-trained vision encoders and large language models. However, current LVLMs primarily rely on visual features extracted from the final layers of the vision encoder, overlooking the complementary information available in shallower layers. While recent approaches have explored the use of multilayer visual features in LVLMs, they tend to be task-agnostic and fail to examine the dependencies of hierarchical visual features on specific tasks. To address these gaps, we systematically investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories. Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance. Building on these insights, we propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions, without increasing the number of visual tokens. Extensive evaluations demonstrate the superior performance of our method. Additionally, an in-depth analysis of the aggregator's behavior highlights the dominance of mid-to-high-level features in semantic-rich tasks and the critical role of low-level features in fine-grained perception.
Paper Structure (20 sections, 8 equations, 4 figures, 7 tables)

This paper contains 20 sections, 8 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Attention distribution of LVLMs across image patches when generating the answer token. Each column corresponds to a model trained with visual features extracted from a different layer of the vision encoder. Regions relevant to the text query are highlighted with red bounding boxes.
  • Figure 2: Performance comparison of our method against the baseline model (LLaVA-v1.5-7B llava1.5) and existing hierarchical visual feature fusion methods (DenseConnector denseconnector and MMFuser mmfuser) across 10 mainstream benchmarks.
  • Figure 3: Performance comparison of LVLMs trained using different single-layer visual features across 6 task categories. The vertical axis represents the normalized performance score, where each benchmark is scaled by setting the highest score to 1 and adjusting other models' scores accordingly.
  • Figure 4: (a) Overview of the proposed framework. (b) Detailed architecture of the weight allocator within the instruction-guided vision aggregator.