Image-Caption Encoding for Improving Zero-Shot Generalization
Eric Yang Yu, Christopher Liao, Sathvik Ravi, Theodoros Tsiligkaridis, Brian Kulis
TL;DR
This work tackles zero-shot generalization in vision–language models by introducing Image-Caption Encoding (ICE), a training-free mechanism that refines image-based predictions with caption-derived signals at inference time. ICE constructs a caption embedding from multiple prompts, computes Top-$K$ image and caption probabilities, and fuses them with an adaptive weight $ ext{λ}$ to steer the final decision within the Top-$K$ set, formalized as $S^I_{ ext{ω}} + ext{λ} S^c_{ ext{ω}}$ for the selected class $ ext{ω}$ and $ ext{λ} = ext{ξ} rac{ ext{σ}(S^c_K)}{ ext{max}(igl\\|[ ext{σ}(S^I_K), ext{σ}(S^c_K)]igr\\|_2, ext{ε})}$. ICE is shown to improve zero-shot accuracy by about 0.5% on average and up to 3% on several challenging datasets, while remaining compatible with multiple SOTA backbones (e.g., CoCa, BLIP-2, LLaVA). The method relies on caption properties that can provide complementary information not fully captured by image embeddings, and includes ablations on the number of captions, Top-$K$, and adaptive weighting to understand when ICE is most beneficial. The results indicate practical value for deploying ICE to bolster OOD generalization in real-world, label-scarce settings and demonstrate that captions can be a meaningful diversification signal in multimodal inference.
Abstract
Recent advances in vision-language models have combined contrastive approaches with generative methods to achieve state-of-the-art (SOTA) on downstream inference tasks like zero-shot image classification. However, a persistent issue of these models for image classification is their out-of-distribution (OOD) generalization capabilities. We first show that when an OOD data point is misclassified, the correct class can be typically found in the Top-K predicted classes. In order to steer the model prediction toward the correct class within the top predicted classes, we propose the Image-Caption Encoding (ICE) method, a straightforward approach that directly enforces consistency between the image-conditioned and caption-conditioned predictions at evaluation time only. Intuitively, we take advantage of unique properties of the generated captions to guide our local search for the correct class label within the Top-K predicted classes. We show that our method can be easily combined with other SOTA methods to enhance Top-1 OOD accuracies by 0.5% on average and up to 3% on challenging datasets. Our code: https://github.com/Chris210634/ice
