GiVE: Guiding Visual Encoder to Perceive Overlooked Information
Junjie Li, Jianghong Ma, Xiaofeng Zhang, Yuhang Li, Jianyang Shi
TL;DR
GiVE addresses the gap where visual encoders in multimodal LLMs either lack semantic alignment with text or overlook non-salient objects. It proposes the Attention-Guided Adapter (AG-Adapter) and Object-focused Visual Semantic Learning, governed by three losses—OITC, OIIC, and OID—alongside the MOInst dataset to enable instruction-driven, object-centric perception. The approach yields state-of-the-art performance on image classification and image-text retrieval across LVIS and MOInst, with ablations confirming the necessity of all components. By enabling dynamic focus and comprehensive object retrieval, GiVE enhances semantic alignment and practical utility in vision-language systems.
Abstract
Multimodal Large Language Models have advanced AI in applications like text-to-video generation and visual question answering. These models rely on visual encoders to convert non-text data into vectors, but current encoders either lack semantic alignment or overlook non-salient objects. We propose the Guiding Visual Encoder to Perceive Overlooked Information (GiVE) approach. GiVE enhances visual representation with an Attention-Guided Adapter (AG-Adapter) module and an Object-focused Visual Semantic Learning module. These incorporate three novel loss terms: Object-focused Image-Text Contrast (OITC) loss, Object-focused Image-Image Contrast (OIIC) loss, and Object-focused Image Discrimination (OID) loss, improving object consideration, retrieval accuracy, and comprehensiveness. Our contributions include dynamic visual focus adjustment, novel loss functions to enhance object retrieval, and the Multi-Object Instruction (MOInst) dataset. Experiments show our approach achieves state-of-the-art performance.
