Table of Contents
Fetching ...

CLID: Controlled-Length Image Descriptions with Limited Data

Elad Hirsch, Ayellet Tal

TL;DR

The paper tackles length-controlled image captioning in the presence of scarce long-caption data. It introduces a two-phase framework: first, self-generate a large, varying-length caption dataset from scene graphs using saliency-guided traversal; second, train with a data-selection strategy that blends a small, high-quality trusted corpus with a large, noisy extended corpus, gradually filtering low-quality samples while retaining long-caption information. A quality-score-based sampling scheme with a smooth threshold controls exposure to synthetic data across training iterations, enabling robust length control without sacrificing overall caption quality. Experiments on MS-COCO and related data show substantial improvements in length-control precision, competitive SPICE scores, and human preference for CLID captions, with strong performance also in paragraph generation. The approach is general and applies to longer-form image descriptions, offering practical benefits for varied user needs and applications.

Abstract

Controllable image captioning models generate human-like image descriptions, enabling some kind of control over the generated captions. This paper focuses on controlling the caption length, i.e. a short and concise description or a long and detailed one. Since existing image captioning datasets contain mostly short captions, generating long captions is challenging. To address the shortage of long training examples, we propose to enrich the dataset with varying-length self-generated captions. These, however, might be of varying quality and are thus unsuitable for conventional training. We introduce a novel training strategy that selects the data points to be used at different times during the training. Our method dramatically improves the length-control abilities, while exhibiting SoTA performance in terms of caption quality. Our approach is general and is shown to be applicable also to paragraph generation.

CLID: Controlled-Length Image Descriptions with Limited Data

TL;DR

The paper tackles length-controlled image captioning in the presence of scarce long-caption data. It introduces a two-phase framework: first, self-generate a large, varying-length caption dataset from scene graphs using saliency-guided traversal; second, train with a data-selection strategy that blends a small, high-quality trusted corpus with a large, noisy extended corpus, gradually filtering low-quality samples while retaining long-caption information. A quality-score-based sampling scheme with a smooth threshold controls exposure to synthetic data across training iterations, enabling robust length control without sacrificing overall caption quality. Experiments on MS-COCO and related data show substantial improvements in length-control precision, competitive SPICE scores, and human preference for CLID captions, with strong performance also in paragraph generation. The approach is general and applies to longer-form image descriptions, offering practical benefits for varied user needs and applications.

Abstract

Controllable image captioning models generate human-like image descriptions, enabling some kind of control over the generated captions. This paper focuses on controlling the caption length, i.e. a short and concise description or a long and detailed one. Since existing image captioning datasets contain mostly short captions, generating long captions is challenging. To address the shortage of long training examples, we propose to enrich the dataset with varying-length self-generated captions. These, however, might be of varying quality and are thus unsuitable for conventional training. We introduce a novel training strategy that selects the data points to be used at different times during the training. Our method dramatically improves the length-control abilities, while exhibiting SoTA performance in terms of caption quality. Our approach is general and is shown to be applicable also to paragraph generation.
Paper Structure (8 sections, 3 equations, 7 figures, 3 tables)

This paper contains 8 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Length-controlled image captioning. People describe a given image briefly or in length. Most previous works generate short captions, which are prevalent in existing datasets. We propose a method that generates captions of sought-after lengths. Our method generates long captions, which hardly exist in training datasets, and achieves comparable results to SoTA methods for short captions.
  • Figure 2: Outline. (I) To overcome the shortage in long captions in trusted datasets (green), a new dataset is self-generated (red) using scene graphs, creating an extended dataset. (II) During training, the low-quality data is gradually filtered out, while remembering the information learned from it. This improves length control and preserves captioning quality.
  • Figure 3: Length of captioning datasets. The average caption length in the trusted (MS-COCO) dataset is $11.95$ tokens with standard deviation of $2.58$, whereas in our extended dataset these are $21.3$ and $13.56$, respectively. (Overlaps cause the third color.)
  • Figure 4: Data separation. Quality scores, computed by Eq. \ref{['eq:noise']}, manage to separate the trusted data from the self-generated data. The trusted data points have mostly ($90\%$) positive scores (dashed blue line), while the self-generated data points have mostly ($99\%$) negative scores (dashed orange line).
  • Figure 5: Captioning performance. In terms of the SPICE quality measure (vertical axis), our results (orange star) are similar to deng2020length's (green circle), which is trained on the trusted dataset. The quality of other solutions (gray/purple) is dramatically degraded. While comparable to deng2020length quality-wise, our model improves the control precision (horizontal axis). In both measures, higher is better. The figure shows $3$ length levels; the other levels appear in the supplements.
  • ...and 2 more figures