Image-Caption Encoding for Improving Zero-Shot Generalization

Eric Yang Yu; Christopher Liao; Sathvik Ravi; Theodoros Tsiligkaridis; Brian Kulis

Image-Caption Encoding for Improving Zero-Shot Generalization

Eric Yang Yu, Christopher Liao, Sathvik Ravi, Theodoros Tsiligkaridis, Brian Kulis

TL;DR

This work tackles zero-shot generalization in vision–language models by introducing Image-Caption Encoding (ICE), a training-free mechanism that refines image-based predictions with caption-derived signals at inference time. ICE constructs a caption embedding from multiple prompts, computes Top-$K$ image and caption probabilities, and fuses them with an adaptive weight $ ext{λ}$ to steer the final decision within the Top-$K$ set, formalized as $S^I_{ ext{ω}} + ext{λ} S^c_{ ext{ω}}$ for the selected class $ ext{ω}$ and $ ext{λ} = ext{ξ} rac{ ext{σ}(S^c_K)}{ ext{max}(igl\\|[ ext{σ}(S^I_K), ext{σ}(S^c_K)]igr\\|_2, ext{ε})}$. ICE is shown to improve zero-shot accuracy by about 0.5% on average and up to 3% on several challenging datasets, while remaining compatible with multiple SOTA backbones (e.g., CoCa, BLIP-2, LLaVA). The method relies on caption properties that can provide complementary information not fully captured by image embeddings, and includes ablations on the number of captions, Top-$K$, and adaptive weighting to understand when ICE is most beneficial. The results indicate practical value for deploying ICE to bolster OOD generalization in real-world, label-scarce settings and demonstrate that captions can be a meaningful diversification signal in multimodal inference.

Abstract

Recent advances in vision-language models have combined contrastive approaches with generative methods to achieve state-of-the-art (SOTA) on downstream inference tasks like zero-shot image classification. However, a persistent issue of these models for image classification is their out-of-distribution (OOD) generalization capabilities. We first show that when an OOD data point is misclassified, the correct class can be typically found in the Top-K predicted classes. In order to steer the model prediction toward the correct class within the top predicted classes, we propose the Image-Caption Encoding (ICE) method, a straightforward approach that directly enforces consistency between the image-conditioned and caption-conditioned predictions at evaluation time only. Intuitively, we take advantage of unique properties of the generated captions to guide our local search for the correct class label within the Top-K predicted classes. We show that our method can be easily combined with other SOTA methods to enhance Top-1 OOD accuracies by 0.5% on average and up to 3% on challenging datasets. Our code: https://github.com/Chris210634/ice

Image-Caption Encoding for Improving Zero-Shot Generalization

TL;DR

image and caption probabilities, and fuses them with an adaptive weight

to steer the final decision within the Top-

set, formalized as

for the selected class

and

. ICE is shown to improve zero-shot accuracy by about 0.5% on average and up to 3% on several challenging datasets, while remaining compatible with multiple SOTA backbones (e.g., CoCa, BLIP-2, LLaVA). The method relies on caption properties that can provide complementary information not fully captured by image embeddings, and includes ablations on the number of captions, Top-

, and adaptive weighting to understand when ICE is most beneficial. The results indicate practical value for deploying ICE to bolster OOD generalization in real-world, label-scarce settings and demonstrate that captions can be a meaningful diversification signal in multimodal inference.

Abstract

Paper Structure (23 sections, 3 equations, 7 figures, 5 tables)

This paper contains 23 sections, 3 equations, 7 figures, 5 tables.

Introduction
Related Works
Methodology
Preliminaries
Image-Caption Encoding
Additional Modifications
Caption Properties
Experiments
Datasets.
Zero-Shot Classification
Baselines.
Results.
Few-Shot Classification
Baselines.
Results.
...and 8 more sections

Figures (7)

Figure 1: A demonstration for how our ICE method can be used to reclassify correctly. In these examples, ICE is applied directly to a frozen pre-trained CLIP-based model for zero-shot classification. Using the contexts given from the generated captions, ICE is able to successfully influence the pretrained model into predicting the correct classes.
Figure 2: A visualization of the Top-5 accuracies on misclassified Top-1 datapoints in each test dataset. Recall that correct Top-5 classifications form a strict superset over the correct Top-1 classifications. We observe that across all datasets, the true correct class can be found within the Top-5 predicted classes for most misclassified datapoints.
Figure 3: A visualization of Top-1 accuracies between CLIP, CoCa using image embeddings only, and CoCa using caption embeddings only. We observe that while caption embeddings generally underperform compared to standard CoCa, they still retain competitive performance. We include more details on datasets and experiments in Section \ref{['subsec::experiments::datasets']}.
Figure 4: An overview of our Image-Caption Encoding (ICE) method. Here, we query a captioner and obtain the caption embedding using the text encoder. We calculate the image and caption probability distributions over the classes by passing the image embeddings, caption embeddings, and class embeddings through the cosine similarity function $\theta$ and softmax operation. Then, we select the Top-$K$ classes and perform a weighted sum of the image and caption probabilities. The weight on the caption prediction $\lambda$ is adaptively selected based on the relative confidence of the image and caption predictions.
Figure 5: A more detailed look at how our Image-Caption Encoding (ICE) method works. In practice, instead of using a single caption for ICE, we use the centroid of $\upsilon$ differently-prompted caption embeddings. Then, using the centroid caption embedding, we adaptively select the $\lambda$ weight by comparing the standard deviations of the image prediction probabilities and caption prediction probabilities, over the Top-$5$ classes. The final ICE scores are then a $\lambda$-weighted sum between the two probability distributions.
...and 2 more figures

Image-Caption Encoding for Improving Zero-Shot Generalization

TL;DR

Abstract

Image-Caption Encoding for Improving Zero-Shot Generalization

Authors

TL;DR

Abstract

Table of Contents

Figures (7)