Table of Contents
Fetching ...

GOAL: Global-local Object Alignment Learning

Hyungyu Choi, Young Kyun Jang, Chanho Eom

TL;DR

GOAL addresses the limitation of CLIP in handling lengthy text descriptions by introducing a global-local alignment framework built from Local Image-Sentence Matching (LISM) and Token Similarity-based Learning (TSL). LISM generates pseudo local image-text pairs by segmenting images with SAM and splitting captions into sentences, then matching these pieces via CLIP embeddings. TSL propagates local element attention through both image and text by learning coordinated local and global representations with a multi-term training objective, improving fine-grained cross-modal alignment. Evaluations on DOCCI, DCI, and Urban1k LongCLIP show significant improvements over baseline CLIP fine-tuning and Long-CLIP, while preserving global understanding in zero-shot and short-caption settings, indicating strong practical impact for image-lengthy text retrieval tasks.

Abstract

Vision-language models like CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions because of their training focus on short and concise captions. We present GOAL (Global-local Object Alignment Learning), a novel fine-tuning method that enhances CLIP's ability to handle lengthy text by leveraging both global and local semantic alignments between image and lengthy text. Our approach consists of two key components: Local Image-Sentence Matching (LISM), which identifies corresponding pairs between image segments and descriptive sentences, and Token Similarity-based Learning (TSL), which efficiently propagates local element attention through these matched pairs. Evaluating GOAL on three new benchmarks for image-lengthy text retrieval, we demonstrate significant improvements over baseline CLIP fine-tuning, establishing a simple yet effective approach for adapting CLIP to detailed textual descriptions. Through extensive experiments, we show that our method's focus on local semantic alignment alongside global context leads to more nuanced and representative embeddings, particularly beneficial for tasks requiring fine-grained understanding of lengthy text descriptions.

GOAL: Global-local Object Alignment Learning

TL;DR

GOAL addresses the limitation of CLIP in handling lengthy text descriptions by introducing a global-local alignment framework built from Local Image-Sentence Matching (LISM) and Token Similarity-based Learning (TSL). LISM generates pseudo local image-text pairs by segmenting images with SAM and splitting captions into sentences, then matching these pieces via CLIP embeddings. TSL propagates local element attention through both image and text by learning coordinated local and global representations with a multi-term training objective, improving fine-grained cross-modal alignment. Evaluations on DOCCI, DCI, and Urban1k LongCLIP show significant improvements over baseline CLIP fine-tuning and Long-CLIP, while preserving global understanding in zero-shot and short-caption settings, indicating strong practical impact for image-lengthy text retrieval tasks.

Abstract

Vision-language models like CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions because of their training focus on short and concise captions. We present GOAL (Global-local Object Alignment Learning), a novel fine-tuning method that enhances CLIP's ability to handle lengthy text by leveraging both global and local semantic alignments between image and lengthy text. Our approach consists of two key components: Local Image-Sentence Matching (LISM), which identifies corresponding pairs between image segments and descriptive sentences, and Token Similarity-based Learning (TSL), which efficiently propagates local element attention through these matched pairs. Evaluating GOAL on three new benchmarks for image-lengthy text retrieval, we demonstrate significant improvements over baseline CLIP fine-tuning, establishing a simple yet effective approach for adapting CLIP to detailed textual descriptions. Through extensive experiments, we show that our method's focus on local semantic alignment alongside global context leads to more nuanced and representative embeddings, particularly beneficial for tasks requiring fine-grained understanding of lengthy text descriptions.

Paper Structure

This paper contains 22 sections, 14 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Comparison of CLIP and our GOAL's capability in handling image-text alignment. (a) CLIP is limited to global image-text matching, treating the entire image and full caption as single units without detailed associations. (b) GOAL can establish precise local alignments between specific regions in the image and their corresponding textual descriptions in the caption (highlighted in purple).
  • Figure 2: Overview of Local Image-Sentence Matching (LISM) pipeline. Given a global image and its detailed caption, LISM uses SAM to segment the image into local regions and splits the caption into individual sentences. These local pairs are then processed through CLIP encoders to obtain CLS embeddings, which are used for maximum similarity matching to identify the most relevant image-sentence pairs.
  • Figure 3: Overview of Token Similarity based Learning (TSL). The framework processes global image-text pairs and their local pairs through shared CLIP encoders, extracting patch and sequence tokens. TSL identifies and projects corresponding token regions to match local CLS embeddings, enabling attention on local element.
  • Figure 4: Comparison of attention maps generated by GOAL and w/o TSL methods. For each row pair, we present three components: (1) original input image (left), (2) attention heatmap visualization (middle), and (3) overlay of attention on the original image (right). The examples demonstrate how GOAL achieves more focused attention compared to the baseline w/o TSL method. Red circles in the overlay highlight regions where GOAL shows particularly effective attention localization.
  • Figure 5: Qualitative comparison of image-text retrieval results between GOAL (middle column) and Long-CLIP (right column). The retrieved descriptions demonstrate GOAL's superior ability to capture fine-grained details and diverse scene elements across indoor and outdoor environments, while maintaining semantic coherence in lengthy descriptions. Query images are shown in the left column.
  • ...and 1 more figures