Table of Contents
Fetching ...

Text-only Synthesis for Image Captioning

Qing Zhou, Junlin Huang, Qiang Li, Junyu Gao, Qi Wang

TL;DR

ToCa addresses the high cost of paired image-text annotations by proposing text-only synthesis for image captioning. It decomposes captions into lexical pairs and structure templates, and uses an open-source LLM to recombine them into diverse captions, enabling zero-shot cross-domain generalization and data-efficient training. The scheme defines three synthesis settings—in-domain, cross-domain, and data-efficient—and shows substantial CIDEr gains across COCO, Flickr30k, and NoCaps benchmarks with reduced human labor and computing time. The work offers a practical, accessible path for synthetic-text augmentation in captioning and suggests broader applicability to other text-generation tasks.

Abstract

From paired image-text training to text-only training for image captioning, the pursuit of relaxing the requirements for high-cost and large-scale annotation of good quality data remains consistent. In this paper, we propose Text-only Synthesis for Image Captioning (ToCa), which further advances this relaxation with fewer human labor and less computing time. Specifically, we deconstruct caption text into structures and lexical words, which serve as the fundamental components of the caption. By combining different structures and lexical words as inputs to the large language model, massive captions that contain various patterns of lexical words are generated. This method not only approaches the target domain but also surpasses it by generating new captions, thereby enhancing the zero-shot generalization ability of the model. Considering the different levels of data access in the real world, we define three synthesis scenarios: cross-domain synthesis, in-domain synthesis, and data-efficient synthesis. Experiments in these scenarios demonstrate the generalizability, transferability and practicability of ToCa with a nearly 5 CIDEr improvement for zero-shot cross-domain captioning and a maximum increase of over 20 CIDEr for data-efficient captioning.

Text-only Synthesis for Image Captioning

TL;DR

ToCa addresses the high cost of paired image-text annotations by proposing text-only synthesis for image captioning. It decomposes captions into lexical pairs and structure templates, and uses an open-source LLM to recombine them into diverse captions, enabling zero-shot cross-domain generalization and data-efficient training. The scheme defines three synthesis settings—in-domain, cross-domain, and data-efficient—and shows substantial CIDEr gains across COCO, Flickr30k, and NoCaps benchmarks with reduced human labor and computing time. The work offers a practical, accessible path for synthetic-text augmentation in captioning and suggests broader applicability to other text-generation tasks.

Abstract

From paired image-text training to text-only training for image captioning, the pursuit of relaxing the requirements for high-cost and large-scale annotation of good quality data remains consistent. In this paper, we propose Text-only Synthesis for Image Captioning (ToCa), which further advances this relaxation with fewer human labor and less computing time. Specifically, we deconstruct caption text into structures and lexical words, which serve as the fundamental components of the caption. By combining different structures and lexical words as inputs to the large language model, massive captions that contain various patterns of lexical words are generated. This method not only approaches the target domain but also surpasses it by generating new captions, thereby enhancing the zero-shot generalization ability of the model. Considering the different levels of data access in the real world, we define three synthesis scenarios: cross-domain synthesis, in-domain synthesis, and data-efficient synthesis. Experiments in these scenarios demonstrate the generalizability, transferability and practicability of ToCa with a nearly 5 CIDEr improvement for zero-shot cross-domain captioning and a maximum increase of over 20 CIDEr for data-efficient captioning.
Paper Structure (17 sections, 11 equations, 3 figures, 12 tables)

This paper contains 17 sections, 11 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: The effect of the number of synthesized text and different $\tau$. $\mathcal{D+T}$ represents training on synthesized data $\mathcal{D}$ and fine-tuning on accessible target data $\mathcal{T}$. $\mathcal{D}$ represents training solely on synthesized data. $\Delta$ denotes the difference in CIDEr scores between $\mathcal{D+T}$ and $\mathcal{D}$.
  • Figure 2: Visualization of the process and results of synthesizing and captioning.
  • Figure 3: t-SNE visualizations of Flickr30k, COCO, and ToCa. (a)-(c) and (d)-(f) respectively represent the variations in the distribution relations of features between Flickr30k and COCO, as well as between ToCa and COCO, across a relative quantity range from $0.1\times$ to $1\times$ to $10\times$.