Table of Contents
Fetching ...

ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

Sibo Dong, Ismail Shaheen, Maggie Shen, Rupayan Mallick, Sarah Adel Bargal

TL;DR

ViSTA tackles visual storytelling by introducing a multi-modal history fusion module and a lightweight history adapter that condition a frozen diffusion model on past text–image history. A salient history selection mechanism focuses conditioning on the most informative history at each step, improving coherence without full model fine-tuning. The approach is evaluated with a TIFA-based text–image alignment metric and shows improved narrative alignment and frame-wise consistency on StorySalon and FlintStonesSV compared to state-of-the-art baselines. The work offers a practical, efficient solution for story-driven image generation and introduces targeted evaluation for alignment in visual storytelling.

Abstract

Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, \textbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images. Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions.

ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

TL;DR

ViSTA tackles visual storytelling by introducing a multi-modal history fusion module and a lightweight history adapter that condition a frozen diffusion model on past text–image history. A salient history selection mechanism focuses conditioning on the most informative history at each step, improving coherence without full model fine-tuning. The approach is evaluated with a TIFA-based text–image alignment metric and shows improved narrative alignment and frame-wise consistency on StorySalon and FlintStonesSV compared to state-of-the-art baselines. The work offers a practical, efficient solution for story-driven image generation and introduces targeted evaluation for alignment in visual storytelling.

Abstract

Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, \textbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images. Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions.

Paper Structure

This paper contains 22 sections, 10 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of our proposed ViSTA model. (a) To generate current image $I_{k+1}$, ViSTA takes the current text prompt $P_{k+1}$, history prompt $P_k$, and history image $I_k$ as input. We propose a multi-modal history fusion model to extract and integrate the historical information and output a fusion feature $c^F$. We utilize a lightweight history adapter to condition on the fusion feature, avoiding modifying the diffusion models' architecture or full fine-tuning of the models; only the fusion model and adapter need to be trained. (b) Our proposed multi-modal history fusion model consists of $d$ blocks of history cross-attention and feedforward layers. In the history cross-attention, we use the current prompt embedding $c^P_{k+1}$ as the query, and the concatenation of history prompt $c^P_{k}$ and image embedding $c^I_{k}$ as the key and value. (c) Salient history selection strategy during inference. Instead of using all history prompts $P_{1, ..., k}$ and images $I_{1, ..., k}$ as references for the current generation, we select the most salient history based on the attention score $s$. In this way, we make the model focus on the most informative history.
  • Figure 2: TIFA evaluation example for the StorySalon test dataset. Given the caption, the following question-answer pairs are generated by a language model. Both binary Yes/No questions and multiple choices questions can be generated. Then a VQA model UnifiedQA evaluates the generated image based on the question-answer pairs. The final TIFA score for the generated image is the average on all questions.
  • Figure 3: Qualitative results: StorySalon. This figure presents a comparison of storytelling between ViSTA, baseline methods, and state-of-the-art on a sample StorySalon story. We include ground truth images from the dataset as a reference for evaluating visual coherence and accuracy. The same reference image is used across IP-Adapter, StoryGen, and ViSTA. The first story prompt corresponds to the reference image, while story prompts 2 to 7 are used to generate the subsequent frames. While SDXL-Prompt show high-quality and well-aligned images, they fail in generating consistent character acorss all frames. Although IP-Adapter shows consistent character, the generated images do not align with the prompt. Compare with the state-of-the-art StoryGen, our ViSTA shows better consistency on both characters and style.
  • Figure 4: ViSTA sample on FlintStonesSV test set.
  • Figure 5: ViSTA sample on FlintStonesSV test set.
  • ...and 4 more figures