Table of Contents
Fetching ...

FashionFlow: Leveraging Diffusion Models for Dynamic Fashion Video Synthesis from Static Imagery

Tasin Islam, Alina Miron, XiaoHui Liu, Yongmin Li

TL;DR

FashionFlow presents a diffusion-based pipeline that generates short fashion videos from a single image by operating on a latent video space $V$ conditioned both locally via a VAE-encoded first frame $I_{vae}$ and globally via cross-attention with $I_{vae}$ and $I_{clip}$. The method employs pseudo-3D convolution, frame interpolation, and multi-level attention to produce temporally coherent, high-resolution videos without person-specific fine-tuning, achieving strong quantitative and qualitative results against GAN-based and prior diffusion approaches. An extensive ablation study demonstrates the advantage of combining global and local conditioning for preserving garment details and colors. The work highlights significant practical impact for online fashion shopping by enabling rapid, high-quality video synthesis that enhances product visualization and user experience. Overall, FashionFlow advances diffusion-based video generation in fashion by delivering fast, detail-preserving conditioned videos suitable for marketing and e-commerce contexts.

Abstract

Our study introduces a new image-to-video generator called FashionFlow to generate fashion videos. By utilising a diffusion model, we are able to create short videos from still fashion images. Our approach involves developing and connecting relevant components with the diffusion model, which results in the creation of high-fidelity videos that are aligned with the conditional image. The components include the use of pseudo-3D convolutional layers to generate videos efficiently. VAE and CLIP encoders capture vital characteristics from still images to condition the diffusion model at a global level. Our research demonstrates a successful synthesis of fashion videos featuring models posing from various angles, showcasing the fit and appearance of the garment. Our findings hold great promise for improving and enhancing the shopping experience for the online fashion industry.

FashionFlow: Leveraging Diffusion Models for Dynamic Fashion Video Synthesis from Static Imagery

TL;DR

FashionFlow presents a diffusion-based pipeline that generates short fashion videos from a single image by operating on a latent video space conditioned both locally via a VAE-encoded first frame and globally via cross-attention with and . The method employs pseudo-3D convolution, frame interpolation, and multi-level attention to produce temporally coherent, high-resolution videos without person-specific fine-tuning, achieving strong quantitative and qualitative results against GAN-based and prior diffusion approaches. An extensive ablation study demonstrates the advantage of combining global and local conditioning for preserving garment details and colors. The work highlights significant practical impact for online fashion shopping by enabling rapid, high-quality video synthesis that enhances product visualization and user experience. Overall, FashionFlow advances diffusion-based video generation in fashion by delivering fast, detail-preserving conditioned videos suitable for marketing and e-commerce contexts.

Abstract

Our study introduces a new image-to-video generator called FashionFlow to generate fashion videos. By utilising a diffusion model, we are able to create short videos from still fashion images. Our approach involves developing and connecting relevant components with the diffusion model, which results in the creation of high-fidelity videos that are aligned with the conditional image. The components include the use of pseudo-3D convolutional layers to generate videos efficiently. VAE and CLIP encoders capture vital characteristics from still images to condition the diffusion model at a global level. Our research demonstrates a successful synthesis of fashion videos featuring models posing from various angles, showcasing the fit and appearance of the garment. Our findings hold great promise for improving and enhancing the shopping experience for the online fashion industry.
Paper Structure (20 sections, 5 equations, 6 figures, 2 tables)

This paper contains 20 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The architecture of our proposed image-to-video model. Our approach involves a latent diffusion model rombach2022high to denoise the latent space of a video. Each frame of the latent space is then processed by a pre-trained VAE decoder to generate the final video. We condition the video in two ways: locally and globally. Local conditioning involves adding a VAE-encoded image as the first frame of the noisy latent, while global conditioning involves using cross-attention layers to influence intermediate features with the conditioning image throughout the layers of the U-Net.
  • Figure 2: The architecture of the pseudo-3D convolutional and attention layers. b, c, f, h, w represent the number or value of batch, channel, frame, height and width, respectively. (a) The pseudo-3D convolutional layer eases optimisation and performs better than its standard counterpart. (b) The spatiotemporal attention layer helps the model generate high-quality video frames while maintaining smoothness and consistency. (c) The cross-attention layer allows the model to condition the synthesised video based on the input image.
  • Figure 3: Qualitative comparison of our method against ImaGINator wang2020imaginator, cINNs dorkenwald2021stochastic, Poke blattmann2021understanding and DreamPose karras2023dreampose. Our method performs a wider range of movements and is comparable to DreamPose in terms of quality and temporal consistency.
  • Figure 4: Qualitative comparison of our method against ImaGINator wang2020imaginator, cINNs dorkenwald2021stochastic, Poke blattmann2021understanding and DreamPose karras2023dreampose. Our method performs a wider range of movements and is comparable to DreamPose in terms of quality and temporal consistency.
  • Figure 5: The effects of image conditioning. Global conditioning captures the overall colour of the garment, but it misses out on smaller details like the white stripes. Local conditioning darkened the skin colour too much and also failed to capture small clothing details. Using both local and global conditioning, it captures the overall colour from the conditioning image, and the model was able to pick up small details like the stripe.
  • ...and 1 more figures