Exploring Diverse In-Context Configurations for Image Captioning

Xu Yang; Yongliang Wu; Mingzhuo Yang; Haokun Chen; Xin Geng

Exploring Diverse In-Context Configurations for Image Captioning

Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, Xin Geng

TL;DR

This study systematically analyzes how multi-modal in-context configurations affect few-shot image captioning with Vision-Language Models. By varying image-selection (RS, SIIR, SICR-CLIP, DIIR variants) and caption-assignment (GTC, MGC variants, IP, MGCA) strategies on MSCOCO using Open-Flamingo and Otter backbones, the authors reveal two key insights: (1) caption descriptiveness and language patterns impact VL in-context learning differently depending on image context, and (2) excessive similarity between in-context and test images can induce short-cut inferences. The work reports substantial CIDEr gains, up to an average of 20.9 points, and provides practical guidelines and iterative prompting methods for cases with limited or no ground-truth captions. These findings emphasize the importance of multi-modal synergy in in-context learning and offer strategies that generalize across VL backbones, informing future design of VL prompting systems. The study also acknowledges limitations tied to the open-source Open-Flamingo baseline and suggests evaluating with stronger multi-modal models to validate and extend the conclusions.

Abstract

After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, ie., randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case. Furthermore, in our exploration of optimal combination strategies, we observed an average performance enhancement of 20.9 of CIDEr scores compared to the baseline. The code is given in https://github.com/yongliang-wu/ExploreCfg.

Exploring Diverse In-Context Configurations for Image Captioning

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 11 figures, 8 tables)

This paper contains 17 sections, 1 equation, 11 figures, 8 tables.

Introduction
Related Work
Configuring In-Context Sequences
Selecting Images
Assigning Captions
Experiments
Dataset and Implementation Details
Results and Analyses
Effects of Caption Qualities
Effects of Image Qualities
Conclusion and Limitations
Experimental results on Open-Flamingo v1 9B
Experimental results on Open-Flamingo v2 3B
Experimental results on Otter
More Results of MGC-TF@135 vs. GTC
...and 2 more sections

Figures (11)

Figure 1: The distinction between LM and VLMs as few-shot learners. LM generally excel with examples akin to the test case (blue blocks in (a)). In contrast, for VLMs, the performance is not strictly correlated with image similarity but heavily relies on the caption quality. For instance, when low-quality captions are used, similar images (d) lead to worse performance than dissimilar ones (f) since VLMs may build a short-cut by reusing in-context captions without seeing the given images.
Figure 2: Image selection strategies: (a) SIIR-CLIP, (b) SIIR-TAG, (c) DIIR-TT, (d) SICR-CLIP.
Figure 3: The line charts of various in-context captions with diverse image-selection strategies.
Figure 4: The line charts of various in-context images with diverse caption-assignment strategies.
Figure 5: The histograms of various in-context captions with diverse image-selection strategies.
...and 6 more figures

Exploring Diverse In-Context Configurations for Image Captioning

TL;DR

Abstract

Exploring Diverse In-Context Configurations for Image Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)