Table of Contents
Fetching ...

Remodeling Semantic Relationships in Vision-Language Fine-Tuning

Xiangyang Wu, Liu Liu, Baosheng Yu, Jiayan Qiu, Zhenwei Shi

TL;DR

This work addresses the difficulty of modeling inter-semantic relationships in vision-language fine-tuning. It introduces the Learnable Semantic Relationship Method (LSRM), comprising three components: multilevel information fusion to capture both intermediate and final visual semantics, a semantic relationship projector with a learnable diagonal emphasis matrix, and an inheritable cross-attention mechanism that suppresses low-correlation cross-modal pairs across layers. Empirical results across eight foundation models show state-of-the-art performance on ScienceQA and competitive results on COCO Caption under a parameter-efficient fine-tuning regime, with notable improvements in accuracy and caption quality. The approach enables more robust cross-modal alignment and semantic reasoning, offering practical gains for multimodal tasks like visual question answering and image captioning while maintaining training and inference efficiency.

Abstract

Vision-language fine-tuning has emerged as an efficient paradigm for constructing multimodal foundation models. While textual context often highlights semantic relationships within an image, existing fine-tuning methods typically overlook this information when aligning vision and language, thus leading to suboptimal performance. Toward solving this problem, we propose a method that can improve multimodal alignment and fusion based on both semantics and relationships.Specifically, we first extract multilevel semantic features from different vision encoder to capture more visual cues of the relationships. Then, we learn to project the vision features to group related semantics, among which are more likely to have relationships. Finally, we fuse the visual features with the textual by using inheritable cross-attention, where we globally remove the redundant visual relationships by discarding visual-language feature pairs with low correlation. We evaluate our proposed method on eight foundation models and two downstream tasks, visual question answering and image captioning, and show that it outperforms all existing methods.

Remodeling Semantic Relationships in Vision-Language Fine-Tuning

TL;DR

This work addresses the difficulty of modeling inter-semantic relationships in vision-language fine-tuning. It introduces the Learnable Semantic Relationship Method (LSRM), comprising three components: multilevel information fusion to capture both intermediate and final visual semantics, a semantic relationship projector with a learnable diagonal emphasis matrix, and an inheritable cross-attention mechanism that suppresses low-correlation cross-modal pairs across layers. Empirical results across eight foundation models show state-of-the-art performance on ScienceQA and competitive results on COCO Caption under a parameter-efficient fine-tuning regime, with notable improvements in accuracy and caption quality. The approach enables more robust cross-modal alignment and semantic reasoning, offering practical gains for multimodal tasks like visual question answering and image captioning while maintaining training and inference efficiency.

Abstract

Vision-language fine-tuning has emerged as an efficient paradigm for constructing multimodal foundation models. While textual context often highlights semantic relationships within an image, existing fine-tuning methods typically overlook this information when aligning vision and language, thus leading to suboptimal performance. Toward solving this problem, we propose a method that can improve multimodal alignment and fusion based on both semantics and relationships.Specifically, we first extract multilevel semantic features from different vision encoder to capture more visual cues of the relationships. Then, we learn to project the vision features to group related semantics, among which are more likely to have relationships. Finally, we fuse the visual features with the textual by using inheritable cross-attention, where we globally remove the redundant visual relationships by discarding visual-language feature pairs with low correlation. We evaluate our proposed method on eight foundation models and two downstream tasks, visual question answering and image captioning, and show that it outperforms all existing methods.

Paper Structure

This paper contains 27 sections, 9 equations, 5 figures, 11 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of our paper: (a) Existing VL fine-tuning methods have weak ability to capture semantic relationships; (b) Our method extracts global information from vision encoder and strengthens relationship modeling through better alignment.
  • Figure 2: The main LSRM(Learnable Semantic Relationship Method) framework. "$\oplus$" denotes matrix addition, "$\otimes$" denotes matrix multiplication, and "$\odot$" denotes the Hadamard product of matrices (i.e., element-wise multiplication). The example of input and output are from ScienceQA lu2022learn datasets
  • Figure 3: Visualization of Inheritable Cross-Attention.In each row, the left figure is the original image, while the middle and right figures demonstrate the value of inheritable matrix between two representative text tokens and each image tokens.
  • Figure 4: Comparison of different hyperparameter settings in the Inheritable Cross-Attention with LLaMA-7B as the language model.
  • Figure 5: (a)Visualization of multilevel information fusion. In each row, the left figure is the original image, while the middle and right figures demonstrate the attention intensity (specifically, the maximum value of the projected output) for each image patch from the final layer output and intermediate layer output of the visual encoder, respectively. (b)Visualization of semantic relationship projector. In each row, the left figure is the original image, while the middle and right figures demonstrate the projection value of each image token across two hidden feature dimension of SRProj. "Weight" denotes the corresponding value in $\Lambda$ of each dimension. "Feature No." denotes the number of each feature dimension