Table of Contents
Fetching ...

Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

Fei Shen, Hu Ye, Sibo Liu, Jun Zhang, Cong Wang, Xiao Han, Wei Yang

TL;DR

This work introduces Rich-contextual Conditional Diffusion Models (RCDMs) to address inconsistencies in story visualization by jointly leveraging rich contextual cues. It comprises a frame-prior transformer diffusion model that predicts frame semantics for an unknown clip using known frames and captions, followed by a frame-contextual 3D diffusion model that fuses image-level references, the predicted frame embedding, and all captions at both image and feature levels to generate coherent multi-frame stories in a single forward pass. Quantitative and qualitative evaluations on FlintstonesSV and PororoSV demonstrate superior FID, character accuracy, and character-F1 scores compared to state-of-the-art baselines, with ablations highlighting the importance of the two-stage design and rich conditioning. The approach also enables branching narratives, faster inference than autoregressive methods, and caption-only generation, offering practical benefits for scalable, consistent story visualization, while acknowledging open-set generalization as future work.

Abstract

Recent research showcases the considerable potential of conditional diffusion models for generating consistent stories. However, current methods, which predominantly generate stories in an autoregressive and excessively caption-dependent manner, often underrate the contextual consistency and relevance of frames during sequential generation. To address this, we propose a novel Rich-contextual Conditional Diffusion Models (RCDMs), a two-stage approach designed to enhance story generation's semantic consistency and temporal consistency. Specifically, in the first stage, the frame-prior transformer diffusion model is presented to predict the frame semantic embedding of the unknown clip by aligning the semantic correlations between the captions and frames of the known clip. The second stage establishes a robust model with rich contextual conditions, including reference images of the known clip, the predicted frame semantic embedding of the unknown clip, and text embeddings of all captions. By jointly injecting these rich contextual conditions at the image and feature levels, RCDMs can generate semantic and temporal consistency stories. Moreover, RCDMs can generate consistent stories with a single forward inference compared to autoregressive models. Our qualitative and quantitative results demonstrate that our proposed RCDMs outperform in challenging scenarios. The code and model will be available at https://github.com/muzishen/RCDMs.

Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

TL;DR

This work introduces Rich-contextual Conditional Diffusion Models (RCDMs) to address inconsistencies in story visualization by jointly leveraging rich contextual cues. It comprises a frame-prior transformer diffusion model that predicts frame semantics for an unknown clip using known frames and captions, followed by a frame-contextual 3D diffusion model that fuses image-level references, the predicted frame embedding, and all captions at both image and feature levels to generate coherent multi-frame stories in a single forward pass. Quantitative and qualitative evaluations on FlintstonesSV and PororoSV demonstrate superior FID, character accuracy, and character-F1 scores compared to state-of-the-art baselines, with ablations highlighting the importance of the two-stage design and rich conditioning. The approach also enables branching narratives, faster inference than autoregressive methods, and caption-only generation, offering practical benefits for scalable, consistent story visualization, while acknowledging open-set generalization as future work.

Abstract

Recent research showcases the considerable potential of conditional diffusion models for generating consistent stories. However, current methods, which predominantly generate stories in an autoregressive and excessively caption-dependent manner, often underrate the contextual consistency and relevance of frames during sequential generation. To address this, we propose a novel Rich-contextual Conditional Diffusion Models (RCDMs), a two-stage approach designed to enhance story generation's semantic consistency and temporal consistency. Specifically, in the first stage, the frame-prior transformer diffusion model is presented to predict the frame semantic embedding of the unknown clip by aligning the semantic correlations between the captions and frames of the known clip. The second stage establishes a robust model with rich contextual conditions, including reference images of the known clip, the predicted frame semantic embedding of the unknown clip, and text embeddings of all captions. By jointly injecting these rich contextual conditions at the image and feature levels, RCDMs can generate semantic and temporal consistency stories. Moreover, RCDMs can generate consistent stories with a single forward inference compared to autoregressive models. Our qualitative and quantitative results demonstrate that our proposed RCDMs outperform in challenging scenarios. The code and model will be available at https://github.com/muzishen/RCDMs.
Paper Structure (17 sections, 6 equations, 16 figures, 3 tables)

This paper contains 17 sections, 6 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: (a) Existing methods, which employ autoregressive models and rely on the current caption for guidance, suffer from weak conditioning, leading to a decrease in the consistency of the generated story. (b) RCDMs initially predict the frame-contextual information at the feature level, then simultaneously infuse image-level and feature-level contextual information to generate coherent stories in a single forward inference.
  • Figure 2: Illustration of the frame-prior transformer diffusion model. The frame-prior transformer diffusion model predicts the frame semantic embeddings of unknown clips by aligning the semantic correlations between the captions and frames of known clips.
  • Figure 3: Overview of the frame-contextual 3D diffusion model. The frame-contextual 3D diffusion model infuses both image-level and feature-level context information to generate stories with stylistic and temporal consistency.
  • Figure 4: Qualitative comparisons with several state-of-the-art models on the FlintstonesSV and PororoSV datasets. Please see Appendix C for more examples.
  • Figure 5: Results of user study. Higher values indicate better performance.
  • ...and 11 more figures