Remodeling Semantic Relationships in Vision-Language Fine-Tuning
Xiangyang Wu, Liu Liu, Baosheng Yu, Jiayan Qiu, Zhenwei Shi
TL;DR
This work addresses the difficulty of modeling inter-semantic relationships in vision-language fine-tuning. It introduces the Learnable Semantic Relationship Method (LSRM), comprising three components: multilevel information fusion to capture both intermediate and final visual semantics, a semantic relationship projector with a learnable diagonal emphasis matrix, and an inheritable cross-attention mechanism that suppresses low-correlation cross-modal pairs across layers. Empirical results across eight foundation models show state-of-the-art performance on ScienceQA and competitive results on COCO Caption under a parameter-efficient fine-tuning regime, with notable improvements in accuracy and caption quality. The approach enables more robust cross-modal alignment and semantic reasoning, offering practical gains for multimodal tasks like visual question answering and image captioning while maintaining training and inference efficiency.
Abstract
Vision-language fine-tuning has emerged as an efficient paradigm for constructing multimodal foundation models. While textual context often highlights semantic relationships within an image, existing fine-tuning methods typically overlook this information when aligning vision and language, thus leading to suboptimal performance. Toward solving this problem, we propose a method that can improve multimodal alignment and fusion based on both semantics and relationships.Specifically, we first extract multilevel semantic features from different vision encoder to capture more visual cues of the relationships. Then, we learn to project the vision features to group related semantics, among which are more likely to have relationships. Finally, we fuse the visual features with the textual by using inheritable cross-attention, where we globally remove the redundant visual relationships by discarding visual-language feature pairs with low correlation. We evaluate our proposed method on eight foundation models and two downstream tasks, visual question answering and image captioning, and show that it outperforms all existing methods.
