CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation

Kalliopi Basioti; Mohamed A. Abdelsalam; Federico Fancellu; Vladimir Pavlovic; Afsaneh Fazly

CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation

Kalliopi Basioti, Mohamed A. Abdelsalam, Federico Fancellu, Vladimir Pavlovic, Afsaneh Fazly

TL;DR

This work tackles the lack of diverse, controllable signals in standard image captioning datasets by introducing Structured Semantic Augmentation (SSA), an AMR-based automatic augmentation framework that constructs a meta-vgAMR and samples semantically coherent, visually grounded subgraphs to generate focused captions and corresponding controls. Building on SSA, the authors propose CIC-BART-SSA, a transformer-based CIC model that accepts simple spatial controls (regions) and length (plus optional verb guidance) without requiring rich, hand-crafted control signals. SSA augments training data from MS-COCO and Flickr datasets, significantly expanding the coverage of highly focused captions and longer sequences, and the CIC-BART-SSA model demonstrates superior diversity and text quality while maintaining competitive controllability, as evidenced by comprehensive metrics including a harmonic overall score. The work shows that AMR-based augmentation can reduce dependence on expensive, richly annotated controls and improves practical CIC performance, with code available for replication and further exploration.

Abstract

Controllable Image Captioning (CIC) aims at generating natural language descriptions for an image, conditioned on information provided by end users, e.g., regions, entities or events of interest. However, available image-language datasets mainly contain captions that describe the entirety of an image, making them ineffective for training CIC models that can potentially attend to any subset of regions or relationships. To tackle this challenge, we propose a novel, fully automatic method to sample additional focused and visually grounded captions using a unified structured semantic representation built on top of the existing set of captions associated with an image. We leverage Abstract Meaning Representation (AMR), a cross-lingual graph-based semantic formalism, to encode all possible spatio-semantic relations between entities, beyond the typical spatial-relations-only focus of current methods. We use this Structured Semantic Augmentation (SSA) framework to augment existing image-caption datasets with the grounded controlled captions, increasing their spatial and semantic diversity and focal coverage. We then develop a new model, CIC-BART-SSA, specifically tailored for the CIC task, that sources its control signals from SSA-diversified datasets. We empirically show that, compared to SOTA CIC models, CIC-BART-SSA generates captions that are superior in diversity and text quality, are competitive in controllability, and, importantly, minimize the gap between broad and highly focused controlled captioning performance by efficiently generalizing to the challenging highly focused scenarios. Code is available at https://github.com/SamsungLabs/CIC-BART-SSA.

CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation

TL;DR

Abstract

Paper Structure (52 sections, 1 equation, 24 figures, 10 tables, 1 algorithm)

This paper contains 52 sections, 1 equation, 24 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Controllable Image Captioning (CIC).
Abstract Meaning Representation (AMR).
AMRs vs. Scene Graphs.
Model
Structured Semantic Augmentation (SSA)
Step 1: Image-level AMR graph generation.
Step 2: Event-based graph sampling from image-level AMRs.
Step 3: New caption generation from sampled AMRs
Step 4: Control signal generation.
Mixing Strategies of Original and SSA Data
Random Sampling Strategy.
Uniform-Coverage Sampling Strategy.
Experimental Setup
...and 37 more sections

Figures (24)

Figure 1: Existing captioning datasets contain captions that describe the entirety of an image. This is reflected in the narrow distributions of the entities that appear in those captions and the caption lengths (the red-colored histograms). CIC aims to generate diverse descriptions by controllably re-focusing on different spatiosemantic aspects of an image, such as the semantically coherent subsets of image objects. Our proposed CIC-BART-SSA is designed to produce diverse, controlled captions ranging from brief and concise to detailed and comprehensive. Sentences 1-15 are example outputs of our approach where the highlighted text indicates the focus of a controllable caption. The histograms demonstrate that our approach generates high-quality descriptions for a wider range of scene focus (number of visual entities) and caption length compared to the original captions. http://cocodataset.org/#explore?id=108338 is licensed under https://creativecommons.org/licenses/by-sa/2.0/.
Figure 2: An example of our structured semantic augmentation approach. We start by using visually-grounded captions (1)-(5) to create a meta-vgAMR graph, which includes all available image information in one representation. We then sample sub-graphs from the meta-vgAMR to generate a new and diverse set of captions (such as sentences (a)-(e)). Our approach takes advantage of both linguistic and spatial diversity, with the latter creating descriptions for new combinations of visual entities. For instance, caption (a) focuses only on the 'boat', and captions (c) and (d) focus on the 'dock' and 'house', combinations that are not explored in the original captions. https://farm3.staticflickr.com/2129/2432734812_f1d31a8726_z.jpg is licensed under https://creativecommons.org/licenses/by-sa/2.0/.
Figure 3: The architecture diagram of our model, CIC-BART, which enables the generation of region- and length-controllable captions. When event information is available, the corresponding verb can be included in the control signal.
Figure 4: Content controllability (IoU) performance of CIC-BART when trained with (blue) and without SSA (green). The first column depicts the IoU on the original test set, and the second on the original test set images' SSA (only) data. The abscissa of each bar plot is the % of the image covered by the control signal, so the left and right parts of the graph represent more focused and broader control signals, respectively. The %Samples curve (orange) represents the distribution of test images in each coverage interval. The results show that SSA plays a crucial role in boosting CIC-BART performance in data-deprived, focused CIC settings.
Figure 5: Qualitative examples for the original test sets. Strikethrough marks hallucinations and redundancies. https://farm5.staticflickr.com/4102/4888234256_538b8dee56_z.jpg, https://farm9.staticflickr.com/8501/8308004994_44eb2d562d_z.jpg licensed under https://creativecommons.org/licenses/by-sa/2.0/; https://flickr.com/photo.gne?id=101362133, https://www.flickr.com/photos/moriza/151970521/ under https://creativecommons.org/licenses/by/2.0/.
...and 19 more figures

CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation

TL;DR

Abstract

CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (24)