Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Longtian Qiu; Shan Ning; Xuming He

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Longtian Qiu, Shan Ning, Xuming He

TL;DR

This work addresses zero-shot image captioning by examining the CLIP embedding space and identifying a modality gap between image and text representations. It reveals that subregion image features often align more closely with paired captions and that the modality gap follows a $0$-mean Gaussian distribution, motivating a region-aware, text-only training regime. The authors propose MacCap, which combines subregion feature aggregation with a learnable adaptor to map CLIP features into an LLM's language space, trained via region-noise-based text reconstruction and enhanced by inference-time sampling and CLIP reranking. Empirically, MacCap achieves strong zero-shot cross-domain and in-domain captioning performance and extends naturally to zero-shot VQA, demonstrating robust cross-modal generalization with a frozen CLIP and language model backbone.

Abstract

Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language Pre-training (CLIP) offers a promising approach to achieving zero-shot captioning, eliminating the need for expensive caption annotations. However, the widely observed modality gap in the latent space of CLIP harms the performance of zero-shot captioning by breaking the alignment between paired image-text features. To address this issue, we conduct an analysis on the CLIP latent space which leads to two findings. Firstly, we observe that the CLIP's visual feature of image subregions can achieve closer proximity to the paired caption due to the inherent information loss in text descriptions. In addition, we show that the modality gap between a paired image-text can be empirically modeled as a zero-mean Gaussian distribution. Motivated by the findings, we propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap. In particular, we introduce a subregion feature aggregation to leverage local region information, which produces a compact visual representation for matching text representation. Moreover, we incorporate a noise injection and CLIP reranking strategy to boost captioning performance. We also extend our framework to build a zero-shot VQA pipeline, demonstrating its generality. Through extensive experiments on common captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that our method achieves remarkable performance improvements. Code is available at https://github.com/Artanic30/MacCap.

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

TL;DR

-mean Gaussian distribution, motivating a region-aware, text-only training regime. The authors propose MacCap, which combines subregion feature aggregation with a learnable adaptor to map CLIP features into an LLM's language space, trained via region-noise-based text reconstruction and enhanced by inference-time sampling and CLIP reranking. Empirically, MacCap achieves strong zero-shot cross-domain and in-domain captioning performance and extends naturally to zero-shot VQA, demonstrating robust cross-modal generalization with a frozen CLIP and language model backbone.

Abstract

Paper Structure (34 sections, 5 equations, 6 figures, 6 tables)

This paper contains 34 sections, 5 equations, 6 figures, 6 tables.

Introduction
Related Work
Zero-shot Image Captioning
Vision-language Models
CLIP Embedding Space Analysis
Modality Gap of CLIP Representations
Distribution of Modality Gap
Methodology
Method Overview
Text Reconstruction Training
Region Noise Injection
Adaptor Decoder
Zero-shot Caption Generation
Sub-region Feature Aggregation
Multiple Sampling and Filtering
...and 19 more sections

Figures (6)

Figure 1: The upper half of this figure is an example of the misalignment in paired image and text description. The lower half of this figure is the distribution of modality gap between text representation and global / local image representation respectively.
Figure 2: An overview of MacCap pipeline. MacCap learns to generate text based on region noise injected CLIP text feature in text reconstruction training. During inference, MacCap can generate caption without paired data in training. The CLIP and language model are kept frozen in both stages.
Figure 3: Multiple sampling and filtering pipeline During inference, each image uses noise to generate several different captions, which are reranked by CLIP to output the best.
Figure 4: Performance of MacCap with different training noise. The MacCap is trained in CC3M dataset and tested on Flickr30K datasets.
Figure 5: Performance of MacCap with different patch numbers in inference. The MacCap is trained in CC3M dataset and tested on Flickr30K datasets. The length of the text region feature in text reconstruction training is set to 10.
...and 1 more figures

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

TL;DR

Abstract

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Authors

TL;DR

Abstract

Table of Contents

Figures (6)