Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights

Shunqi Mao; Chaoyi Zhang; Hang Su; Hwanjun Song; Igor Shalyminov; Weidong Cai

Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights

Shunqi Mao, Chaoyi Zhang, Hang Su, Hwanjun Song, Igor Shalyminov, Weidong Cai

TL;DR

Two approaches are presented, Prompting-based Controller (P-Ctrl) and Recalibration-based Controller (R-Ctrl), to generate focused captions, and a GPT-4V empowered evaluator is designed to assess the quality of the controlled captions alongside standard assessment methods.

Abstract

Contextualized Image Captioning (CIC) evolves traditional image captioning into a more complex domain, necessitating the ability for multimodal reasoning. It aims to generate image captions given specific contextual information. This paper further introduces a novel domain of Controllable Contextualized Image Captioning (Ctrl-CIC). Unlike CIC, which solely relies on broad context, Ctrl-CIC accentuates a user-defined highlight, compelling the model to tailor captions that resonate with the highlighted aspects of the context. We present two approaches, Prompting-based Controller (P-Ctrl) and Recalibration-based Controller (R-Ctrl), to generate focused captions. P-Ctrl conditions the model generation on highlight by prepending captions with highlight-driven prefixes, whereas R-Ctrl tunes the model to selectively recalibrate the encoder embeddings for highlighted tokens. Additionally, we design a GPT-4V empowered evaluator to assess the quality of the controlled captions alongside standard assessment methods. Extensive experimental results demonstrate the efficient and effective controllability of our method, charting a new direction in achieving user-adaptive image captioning. Code is available at https://github.com/ShunqiM/Ctrl-CIC .

Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights

TL;DR

Abstract

Paper Structure (45 sections, 7 equations, 9 figures, 14 tables)

This paper contains 45 sections, 7 equations, 9 figures, 14 tables.

Introduction
Related Work
Contextualized Image Captioning.
Vision-Langauge Model for Image Understanding.
Controllable Text Generation.
Method
Revisiting Contextualized Image Captioning
Controllable Contextual Image Captioning
Controlled Caption Generation
Prompting-Based Controller
Recalibration-Based Controller
Adaptability to CIC
GPT-4V based Evaluation
Highlights Selection
Evaluation
...and 30 more sections

Figures (9)

Figure 1: We introduce the Controllable Contextualized Image Captioning (Ctrl-CIC) task: Given a global context, Ctrl-CIC aims at generating contextualized image captions tailored to specific highlighted segments. In the presented context regarding "succulents", highlights direct the caption's emphasis, underscoring distinct attributes such as its anatomical structure or water content.
Figure 2: Overview of the proposed Ctrl-CIC method. (a) We derive the token-level relevance scores that indicate the probability of the token being part of the highlights for the context-caption pair. (b) Overview of the training pipeline of the Prompting-based and Recalibration-based Controllers. For Ctrl-CIC inference, the model is guided by either new prompts or recalibrated weights based on highlights, to produce controlled captions.
Figure 3: GPT-4V empowered evaluator for Ctrl-CIC task. Given a pair of Ctrl-CIC and reference captions, this GPT-4V evaluator comprehensively reasons and marks them, and derives the final score as the ratio of raw marks between Ctrl-CIC and the reference caption. Note: the reference caption also serves as GT for the standard CIC task.
Figure 4: Qualitative demonstration on our Ctrl-CIC results produced by $\mathcal{P}$-$\mathtt{Ctrl}$. Highlights and their respective Ctrl-CIC captions are aligned in colors, showing how captions vary with different input images and highlights for the same context. Section titles, if any, are appended after the page title. Paragraphs in the context that are without any highlights are omitted for readability.
Figure 5: UI of the scoring application for human evaluation. The top text box displays the context, image description, and the highlighted segments, along with the two captions to be scored based on Context Relevance, Highlight Relevance, Image Consistency, and Overall Quality. Human markers will review the context and score both the reference and Ctrl-CIC captions regarding each metric respectively.
...and 4 more figures

Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights

TL;DR

Abstract

Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights

Authors

TL;DR

Abstract

Table of Contents

Figures (9)