GOAL: Global-local Object Alignment Learning
Hyungyu Choi, Young Kyun Jang, Chanho Eom
TL;DR
GOAL addresses the limitation of CLIP in handling lengthy text descriptions by introducing a global-local alignment framework built from Local Image-Sentence Matching (LISM) and Token Similarity-based Learning (TSL). LISM generates pseudo local image-text pairs by segmenting images with SAM and splitting captions into sentences, then matching these pieces via CLIP embeddings. TSL propagates local element attention through both image and text by learning coordinated local and global representations with a multi-term training objective, improving fine-grained cross-modal alignment. Evaluations on DOCCI, DCI, and Urban1k LongCLIP show significant improvements over baseline CLIP fine-tuning and Long-CLIP, while preserving global understanding in zero-shot and short-caption settings, indicating strong practical impact for image-lengthy text retrieval tasks.
Abstract
Vision-language models like CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions because of their training focus on short and concise captions. We present GOAL (Global-local Object Alignment Learning), a novel fine-tuning method that enhances CLIP's ability to handle lengthy text by leveraging both global and local semantic alignments between image and lengthy text. Our approach consists of two key components: Local Image-Sentence Matching (LISM), which identifies corresponding pairs between image segments and descriptive sentences, and Token Similarity-based Learning (TSL), which efficiently propagates local element attention through these matched pairs. Evaluating GOAL on three new benchmarks for image-lengthy text retrieval, we demonstrate significant improvements over baseline CLIP fine-tuning, establishing a simple yet effective approach for adapting CLIP to detailed textual descriptions. Through extensive experiments, we show that our method's focus on local semantic alignment alongside global context leads to more nuanced and representative embeddings, particularly beneficial for tasks requiring fine-grained understanding of lengthy text descriptions.
