Table of Contents
Fetching ...

GiVE: Guiding Visual Encoder to Perceive Overlooked Information

Junjie Li, Jianghong Ma, Xiaofeng Zhang, Yuhang Li, Jianyang Shi

TL;DR

GiVE addresses the gap where visual encoders in multimodal LLMs either lack semantic alignment with text or overlook non-salient objects. It proposes the Attention-Guided Adapter (AG-Adapter) and Object-focused Visual Semantic Learning, governed by three losses—OITC, OIIC, and OID—alongside the MOInst dataset to enable instruction-driven, object-centric perception. The approach yields state-of-the-art performance on image classification and image-text retrieval across LVIS and MOInst, with ablations confirming the necessity of all components. By enabling dynamic focus and comprehensive object retrieval, GiVE enhances semantic alignment and practical utility in vision-language systems.

Abstract

Multimodal Large Language Models have advanced AI in applications like text-to-video generation and visual question answering. These models rely on visual encoders to convert non-text data into vectors, but current encoders either lack semantic alignment or overlook non-salient objects. We propose the Guiding Visual Encoder to Perceive Overlooked Information (GiVE) approach. GiVE enhances visual representation with an Attention-Guided Adapter (AG-Adapter) module and an Object-focused Visual Semantic Learning module. These incorporate three novel loss terms: Object-focused Image-Text Contrast (OITC) loss, Object-focused Image-Image Contrast (OIIC) loss, and Object-focused Image Discrimination (OID) loss, improving object consideration, retrieval accuracy, and comprehensiveness. Our contributions include dynamic visual focus adjustment, novel loss functions to enhance object retrieval, and the Multi-Object Instruction (MOInst) dataset. Experiments show our approach achieves state-of-the-art performance.

GiVE: Guiding Visual Encoder to Perceive Overlooked Information

TL;DR

GiVE addresses the gap where visual encoders in multimodal LLMs either lack semantic alignment with text or overlook non-salient objects. It proposes the Attention-Guided Adapter (AG-Adapter) and Object-focused Visual Semantic Learning, governed by three losses—OITC, OIIC, and OID—alongside the MOInst dataset to enable instruction-driven, object-centric perception. The approach yields state-of-the-art performance on image classification and image-text retrieval across LVIS and MOInst, with ablations confirming the necessity of all components. By enabling dynamic focus and comprehensive object retrieval, GiVE enhances semantic alignment and practical utility in vision-language systems.

Abstract

Multimodal Large Language Models have advanced AI in applications like text-to-video generation and visual question answering. These models rely on visual encoders to convert non-text data into vectors, but current encoders either lack semantic alignment or overlook non-salient objects. We propose the Guiding Visual Encoder to Perceive Overlooked Information (GiVE) approach. GiVE enhances visual representation with an Attention-Guided Adapter (AG-Adapter) module and an Object-focused Visual Semantic Learning module. These incorporate three novel loss terms: Object-focused Image-Text Contrast (OITC) loss, Object-focused Image-Image Contrast (OIIC) loss, and Object-focused Image Discrimination (OID) loss, improving object consideration, retrieval accuracy, and comprehensiveness. Our contributions include dynamic visual focus adjustment, novel loss functions to enhance object retrieval, and the Multi-Object Instruction (MOInst) dataset. Experiments show our approach achieves state-of-the-art performance.

Paper Structure

This paper contains 17 sections, 13 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Motivation overview. (a) The original reconstruction-based encoder perceives the full image but is not aligned with textual semantics, thereby limiting its utility for LLMs in effectively interpreting the image embeddings. (b) The contrastive learning-based encoder only processes images without the benefit of textual instructions, leading to a focus solely on salient objects (Ⓟ) and neglecting user-specific concerns (Ⓞ). (c) Our proposed visual encoder addresses these limitations by flexibly adjusting its focus to highlight various objects, whether salient (Ⓢ) or non-salient (Ⓝ), according to the provided instructions.
  • Figure 2: Overall architecture of GiVE. The plug-in module, AG-Adapter, is inserted into the feature extraction layers of the visual encoder and trained with the three losses proposed in our work: Object-focused Image-Text Contrast (OITC), Object-focused Image-Image Contrast (OIIC), and Object-focused Image Discrimination (OID). Cross-attention is used to emphasize the visual elements most relevant to the textual instructions. The text instructions are pre-integrated into the prompt template designated as "a photo of {object}".
  • Figure 3: Learning objectives illustration. The image and text encoders jointly compute three losses. (a) For Object-focused Image-Text Contrast, the paired text and image should not only correspond to each other but also correspond to the same semantic object. (b) Object-focused Image-Image Contrast requires the model to predict pairs of image features that contain the same semantic object. (c) Object-focused Image Discrimination determines whether a specific object exists in the image or not. The text instructions, such as "bike", are pre-integrated into the prompt template designated as "a photo of {object}".