Table of Contents
Fetching ...

Causal-Story: Local Causal Attention Utilizing Parameter-Efficient Tuning For Visual Story Synthesis

Tianyi Song, Jiuxin Cao, Kun Wang, Bo Liu, Xiaofeng Zhang

TL;DR

The proposed Causal-Story model incorporates a local causal attention mechanism that considers the causal relationship between previous captions, frames, and current captions and generates the current frame, thereby improving the global consistency of story generation.

Abstract

The excellent text-to-image synthesis capability of diffusion models has driven progress in synthesizing coherent visual stories. The current state-of-the-art method combines the features of historical captions, historical frames, and the current captions as conditions for generating the current frame. However, this method treats each historical frame and caption as the same contribution. It connects them in order with equal weights, ignoring that not all historical conditions are associated with the generation of the current frame. To address this issue, we propose Causal-Story. This model incorporates a local causal attention mechanism that considers the causal relationship between previous captions, frames, and current captions. By assigning weights based on this relationship, Causal-Story generates the current frame, thereby improving the global consistency of story generation. We evaluated our model on the PororoSV and FlintstonesSV datasets and obtained state-of-the-art FID scores, and the generated frames also demonstrate better storytelling in visuals.

Causal-Story: Local Causal Attention Utilizing Parameter-Efficient Tuning For Visual Story Synthesis

TL;DR

The proposed Causal-Story model incorporates a local causal attention mechanism that considers the causal relationship between previous captions, frames, and current captions and generates the current frame, thereby improving the global consistency of story generation.

Abstract

The excellent text-to-image synthesis capability of diffusion models has driven progress in synthesizing coherent visual stories. The current state-of-the-art method combines the features of historical captions, historical frames, and the current captions as conditions for generating the current frame. However, this method treats each historical frame and caption as the same contribution. It connects them in order with equal weights, ignoring that not all historical conditions are associated with the generation of the current frame. To address this issue, we propose Causal-Story. This model incorporates a local causal attention mechanism that considers the causal relationship between previous captions, frames, and current captions. By assigning weights based on this relationship, Causal-Story generates the current frame, thereby improving the global consistency of story generation. We evaluated our model on the PororoSV and FlintstonesSV datasets and obtained state-of-the-art FID scores, and the generated frames also demonstrate better storytelling in visuals.
Paper Structure (10 sections, 7 equations, 3 figures, 2 tables)

This paper contains 10 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: An example of a story in PororoSV with five frames and captions. The green number indicates a dependency relationship between the previous frame and the current frame to be generated, while the red number indicates that it is not related to the generation of the current frame.
  • Figure 2: Model architecture of Causal-Story. Our model is inspired by ARLDM. The solid line box represents the overall structure of the denoising U-Net section of stable diffusion model, while the dashed line box introduces the specific composition of key modules. The green dashed box displays the location of the local causal attention module and adapter, while the gray dashed box displays the details of the local causal attention module.
  • Figure 3: Example of generated images from previous model StoryDALL-E, AR-LDM and our model