Table of Contents
Fetching ...

Customized Visual Storytelling with Unified Multimodal LLMs

Wei-Hua Li, Cheng Sun, Chu-Song Chen

Abstract

Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enhance cinematic diversity, we introduce shot-type control via parameter-efficient prompt tuning on movie data, enabling the model to generate sequences that more faithfully reflect cinematic grammar. To evaluate our framework, we establish two new benchmarks that assess multimodal story customization from the perspectives of character and scene consistency, text-visual alignment, and shot-type control. Experiments demonstrate that VstoryGen achieves improved consistency and cinematic diversity compared to existing methods.

Customized Visual Storytelling with Unified Multimodal LLMs

Abstract

Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enhance cinematic diversity, we introduce shot-type control via parameter-efficient prompt tuning on movie data, enabling the model to generate sequences that more faithfully reflect cinematic grammar. To evaluate our framework, we establish two new benchmarks that assess multimodal story customization from the perspectives of character and scene consistency, text-visual alignment, and shot-type control. Experiments demonstrate that VstoryGen achieves improved consistency and cinematic diversity compared to existing methods.

Paper Structure

This paper contains 29 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Shot-Type Controlled Visual Story Customization
  • Figure 2: Overview of VstoryGen. (1) Multimodal scripts are generated from an overall text description $D$, where the reference images of characters (indexed $1$ to $m$) and background scenes (indexed $1$ to $o$) are generated first, and the multimodal scripts (indexed $1$ to $n$) are generated afterward; (2) CustFilmer produces consistent keyframes corresponding to the scripts from $1, \cdots, n$, respectively; (3) TI2V expands the $n$ keyframes into video.
  • Figure 3: Illustration of CustFilmer, which can take multimodal materials as input to generate sequences of consistent keyframes.
  • Figure 4: Qualitative Comparison between (a) IP-Adapter ye2023ip, (b) StoryDiffusion zhou2024storydiffusion, (c) Consistory tewel2024training, (d) DreamStory he2025dreamstory, (e) CharaConsist wang2025characonsist and (f) CustFilmer (Ours) on MSB
  • Figure 5: Qualitative Comparison between (a) IP-Adapter ye2023ip, (b) StoryDiffusion zhou2024storydiffusion, (c) Consistory tewel2024training, (d) DreamStory he2025dreamstory, (e) CharaConsist wang2025characonsist and (f) CustFilmer (Ours) on M$^2$SB
  • ...and 3 more figures