Text-only Synthesis for Image Captioning
Qing Zhou, Junlin Huang, Qiang Li, Junyu Gao, Qi Wang
TL;DR
ToCa addresses the high cost of paired image-text annotations by proposing text-only synthesis for image captioning. It decomposes captions into lexical pairs and structure templates, and uses an open-source LLM to recombine them into diverse captions, enabling zero-shot cross-domain generalization and data-efficient training. The scheme defines three synthesis settings—in-domain, cross-domain, and data-efficient—and shows substantial CIDEr gains across COCO, Flickr30k, and NoCaps benchmarks with reduced human labor and computing time. The work offers a practical, accessible path for synthetic-text augmentation in captioning and suggests broader applicability to other text-generation tasks.
Abstract
From paired image-text training to text-only training for image captioning, the pursuit of relaxing the requirements for high-cost and large-scale annotation of good quality data remains consistent. In this paper, we propose Text-only Synthesis for Image Captioning (ToCa), which further advances this relaxation with fewer human labor and less computing time. Specifically, we deconstruct caption text into structures and lexical words, which serve as the fundamental components of the caption. By combining different structures and lexical words as inputs to the large language model, massive captions that contain various patterns of lexical words are generated. This method not only approaches the target domain but also surpasses it by generating new captions, thereby enhancing the zero-shot generalization ability of the model. Considering the different levels of data access in the real world, we define three synthesis scenarios: cross-domain synthesis, in-domain synthesis, and data-efficient synthesis. Experiments in these scenarios demonstrate the generalizability, transferability and practicability of ToCa with a nearly 5 CIDEr improvement for zero-shot cross-domain captioning and a maximum increase of over 20 CIDEr for data-efficient captioning.
