IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

Soeun Lee; Si-Woo Kim; Taewhan Kim; Dong-Jin Kim

IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

Soeun Lee, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim

TL;DR

A novel approach called Image-like Retrieval is proposed, which aligns text features with visually relevant features to mitigate the modality gap and enhances the accuracy of generated captions by designing a fusion module that integrates retrieved captions with input features.

Abstract

Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data. However, existing text-only training methods often overlook the modality gap between using text data during training and employing images during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a Fusion Module that integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap ($\textbf{I}$mage-like Retrieval and $\textbf{F}$requency-based Entity Filtering for Zero-shot $\textbf{Cap}$tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training.

IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

TL;DR

Abstract

mage-like Retrieval and

requency-based Entity Filtering for Zero-shot

tioning). Through extensive experimentation, our straightforward yet powerful approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video captioning compared to zero-shot captioning based on text-only training.

Paper Structure (22 sections, 4 equations, 6 figures, 14 tables)

This paper contains 22 sections, 4 equations, 6 figures, 14 tables.

Introduction
Related work
Text-only Captioning
Modality Gap
Retrieval Augmented Generation
Methods
Image-like Retrieval (ILR)
Fusion Module (FM)
Frequency-based Entity Filtering (EF)
Experiments
Implementation Details
Text-only Captioning
In-domain Captioning
Cross-domain Captioning
Video Captioning
...and 7 more sections

Figures (6)

Figure 1: (Top) The previous text-to-text retrieval approach overlooks the modality gap, leading to different information use between training and inference. Our approach addresses this by aligning text features with the image embedding space during retrieval. (Bottom) The traditional CLIP classifier-based entity retrieval method struggles with entity detection as vocabulary size grows. Our approach detects frequently occurring words in retrieved captions, extracting entities more accurately without relying on a limited vocabulary.
Figure 2: The distribution of CLIP embedding features corresponding to images $\textcolor{violet}{\blacksquare}$, paired captions $\textcolor{red}{\medbullet}$, retrieved captions $\textcolor{yellow}{\medbullet}$ for a specific image, and the result of text-to-text retrieval $\textcolor{gray}{\medbullet}$ and our Image-like Retrieval $\textcolor{ao(english)}{\medbullet}$.
Figure 3: Precision of extracted entities in the COCO test set, total 5,000 images. If an extracted entity exists in the ground-truth caption, it counts as correct or else wrong. Three methods (Ours, ViECapviecap, DETRdetr) are compared with three different settings. Our method is illustrated in \ref{['3.3']}, and ViECap uses CLIP based classifier with the source domain's vocabulary list. We follow the way SynTIC syntic uses DETR and employ the COCO vocabulary list. Due to the inaccessible vocabulary list of Flickr30k, DETR can't be compared, and ViECap uses the VGOI vgoi vocabulary list in Flickr30k. Our method dominates the precision score and quantity of entities in every setting.
Figure 4: The overview of IFCap. During training, we extract nouns from the input text and retrieve $k$ similar sentences using our Image-like Retrieval method. Extracted nouns are incorporated into a prompt template to form a hard prompt. Both the input text and retrieved sentences are encoded using the text encoder. These embeddings interact and combine through our Fusion Module before being fed into the LLM for sentence generation. During inference, we retrieve $l$ sentences similar to the input image and construct a hard prompt by extracting entities via Frequency-based Entity Filtering from the retrieved sentences. The sentences are encoded using a text encoder, and the input image is encoded using an image encoder, followed by input into the Fusion Module. The subsequent process follows a procedure similar to the training phase.
Figure 5: Hyper-parameter search for finding best $\sigma_r$ used in Image-like Retrieval. All experiments are conducted with the COCO test set. The X-axis denotes $\sigma_r^2$, and the Y-axis denotes scores of commonly used captioning metrics BLEU@4 (B@4), METEOR (M), CIDEr (C), and SPICE (S).
...and 1 more figures

IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

TL;DR

Abstract

IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)