Table of Contents
Fetching ...

Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning

Yingjin Song, Denis Paperno, Albert Gatt

TL;DR

The paper tackles visual storytelling from image sequences (VIST), aiming to produce coherent, informative narratives while handling visual variation. It presents a lightweight framework that freezes CLIP and GPT-2 and uses a context-aware visual-prefix mapping network, enhanced by curriculum learning and a multimodal contrastive objective to improve grounding and informativeness. Key contributions include two context-integration strategies for prefixes, a curriculum-learning regime, and contrastive training, along with extensive automatic and human evaluations demonstrating competitive performance and favorable grounding and informativeness. The work highlights that strong results can be achieved with frozen foundation models and a simple mapping, while also noting that automatic metrics may not fully align with human judgments in open-ended storytelling.

Abstract

Visual storytelling systems generate multi-sentence stories from image sequences. In this task, capturing contextual information and bridging visual variation bring additional challenges. We propose a simple yet effective framework that leverages the generalization capabilities of pretrained foundation models, only training a lightweight vision-language mapping network to connect modalities, while incorporating context to enhance coherence. We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness. Extensive experimental results, across both automatic metrics and human evaluations, demonstrate that the stories generated by our framework are diverse, coherent, informative, and interesting.

Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning

TL;DR

The paper tackles visual storytelling from image sequences (VIST), aiming to produce coherent, informative narratives while handling visual variation. It presents a lightweight framework that freezes CLIP and GPT-2 and uses a context-aware visual-prefix mapping network, enhanced by curriculum learning and a multimodal contrastive objective to improve grounding and informativeness. Key contributions include two context-integration strategies for prefixes, a curriculum-learning regime, and contrastive training, along with extensive automatic and human evaluations demonstrating competitive performance and favorable grounding and informativeness. The work highlights that strong results can be achieved with frozen foundation models and a simple mapping, while also noting that automatic metrics may not fully align with human judgments in open-ended storytelling.

Abstract

Visual storytelling systems generate multi-sentence stories from image sequences. In this task, capturing contextual information and bridging visual variation bring additional challenges. We propose a simple yet effective framework that leverages the generalization capabilities of pretrained foundation models, only training a lightweight vision-language mapping network to connect modalities, while incorporating context to enhance coherence. We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness. Extensive experimental results, across both automatic metrics and human evaluations, demonstrate that the stories generated by our framework are diverse, coherent, informative, and interesting.
Paper Structure (31 sections, 9 equations, 11 figures, 5 tables)

This paper contains 31 sections, 9 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Illustration of the framework. A Transformer-based mapping network ($\mathcal{MN}_{\mathrm{v}}$) is trained to map visual features from a frozen encoder (CLIP) into a visual prefix for a frozen LLM (GPT2). We incorporate the previous sentences as the context via (1) concatenation after $\mathcal{MN}_{\mathrm{v}}$: previous context is encoded by the LLM (GPT2), combined with the visual prefix and then fed into the LLM decoder; or (2) concatenation before $\mathcal{MN}_{\mathrm{v}}$: previous context is encoded by the CLIP text encoder, combined with CLIP visual features and then fed into $\mathcal{MN}_{\mathrm{v}}$. In addition to the teacher-forcing objective $\mathcal{L}_{\mathrm{NLL}}$, we further compel the model to produce text that aligns semantically with the image through a contrastive training objective $\mathcal{L}_{\mathrm{contras}}$.
  • Figure 2: Impact of context length: CIDEr of various number of previous context sentences with concatenation before (top) and after (bottom) $\mathcal{MN}_{\mathrm{v}}$.
  • Figure 3: Impact of contrastive training object: CLIPScore (top) and SPICE (bottom) of training our models without or with $\mathcal{L}_{\text{contras}}$ .
  • Figure 4: Impact of language model size: BLEU-3, 4 (top) and ROUGE-L (bottom) of our models using GPT2-small, medium, large and xl as text generator with textual context concatenation after $\mathcal{MN}_{\mathrm{v}}$.
  • Figure 5: Qualitative examples of our model and baselines. Words highlighted in yellow are repetitive expressions, and words in red represent content that is not relevant to the image sequence.
  • ...and 6 more figures