Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning
Yingjin Song, Denis Paperno, Albert Gatt
TL;DR
The paper tackles visual storytelling from image sequences (VIST), aiming to produce coherent, informative narratives while handling visual variation. It presents a lightweight framework that freezes CLIP and GPT-2 and uses a context-aware visual-prefix mapping network, enhanced by curriculum learning and a multimodal contrastive objective to improve grounding and informativeness. Key contributions include two context-integration strategies for prefixes, a curriculum-learning regime, and contrastive training, along with extensive automatic and human evaluations demonstrating competitive performance and favorable grounding and informativeness. The work highlights that strong results can be achieved with frozen foundation models and a simple mapping, while also noting that automatic metrics may not fully align with human judgments in open-ended storytelling.
Abstract
Visual storytelling systems generate multi-sentence stories from image sequences. In this task, capturing contextual information and bridging visual variation bring additional challenges. We propose a simple yet effective framework that leverages the generalization capabilities of pretrained foundation models, only training a lightweight vision-language mapping network to connect modalities, while incorporating context to enhance coherence. We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness. Extensive experimental results, across both automatic metrics and human evaluations, demonstrate that the stories generated by our framework are diverse, coherent, informative, and interesting.
