A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models
Hao Huang, Shuaihang Yuan, Yu Hao, Congcong Wen, Yi Fang
TL;DR
The paper tackles the challenge of few-shot image captioning when using frozen large vision and language models by introducing a chain-of-thought (CoT) meta-learning framework with subspace parameterization. It decomposes caption generation into a three-step reasoning process (subject, object, and caption) and learns step-specific meta-parameters in distinct subspaces to minimize interference, enabling effective adaptation with limited data. Empirical results on MSCOCO, Flickr8k, and Flickr30k show that CoT subspace meta-learning outperforms baselines across multiple metrics, demonstrating improved caption quality and grounding in both in-domain and cross-domain settings. The approach offers a principled way to leverage language priors and visual features jointly, with practical impact for scalable multimodal adaptation using frozen foundation models.
Abstract
A large-scale vision and language model that has been pretrained on massive data encodes visual and linguistic prior, which makes it easier to generate images and language that are more natural and realistic. Despite this, there is still a significant domain gap between the modalities of vision and language, especially when training data is scarce in few-shot settings, where only very limited data are available for training. In order to mitigate this issue, a multi-modal meta-learning framework has been proposed to bridge the gap between two frozen pretrained large vision and language models by introducing a tunable prompt connecting these two large models. For few-shot image captioning, the existing multi-model meta-learning framework utilizes a one-step prompting scheme to accumulate the visual features of input images to guide the language model, which struggles to generate accurate image descriptions with only a few training samples. Instead, we propose a chain-of-thought (CoT) meta-learning scheme as a multi-step image captioning procedure to better imitate how humans describe images. In addition, we further propose to learn different meta-parameters of the model corresponding to each CoT step in distinct subspaces to avoid interference. We evaluated our method on three commonly used image captioning datasets, i.e., MSCOCO, Flickr8k, and Flickr30k, under few-shot settings. The results of our experiments indicate that our chain-of-thought subspace meta-learning strategy is superior to the baselines in terms of performance across different datasets measured by different metrics.
