ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

Sibo Dong; Ismail Shaheen; Maggie Shen; Rupayan Mallick; Sarah Adel Bargal

ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

Sibo Dong, Ismail Shaheen, Maggie Shen, Rupayan Mallick, Sarah Adel Bargal

TL;DR

ViSTA tackles visual storytelling by introducing a multi-modal history fusion module and a lightweight history adapter that condition a frozen diffusion model on past text–image history. A salient history selection mechanism focuses conditioning on the most informative history at each step, improving coherence without full model fine-tuning. The approach is evaluated with a TIFA-based text–image alignment metric and shows improved narrative alignment and frame-wise consistency on StorySalon and FlintStonesSV compared to state-of-the-art baselines. The work offers a practical, efficient solution for story-driven image generation and introduces targeted evaluation for alignment in visual storytelling.

Abstract

Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, \textbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images. Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions.

ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

TL;DR

Abstract

ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)