Table of Contents
Fetching ...

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

Feng Liang, Bichen Wu, Jialiang Wang, Licheng Yu, Kunpeng Li, Yinan Zhao, Ishan Misra, Jia-Bin Huang, Peizhao Zhang, Peter Vajda, Diana Marculescu

TL;DR

<3-5 sentence high-level summary>

Abstract

Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos. However, the advancement of video-to-video (V2V) synthesis has been hampered by the challenge of maintaining temporal consistency across video frames. This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video. Contrary to prior methods that strictly adhere to optical flow, our approach harnesses its benefits while handling the imperfection in flow estimation. We encode the optical flow via warping from the first frame and serve it as a supplementary reference in the diffusion model. This enables our model for video synthesis by editing the first frame with any prevalent I2I models and then propagating edits to successive frames. Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility: FlowVid works seamlessly with existing I2I models, facilitating various modifications, including stylization, object swaps, and local edits. (2) Efficiency: Generation of a 4-second video with 30 FPS and 512x512 resolution takes only 1.5 minutes, which is 3.1x, 7.2x, and 10.5x faster than CoDeF, Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender (10.2%), and TokenFlow (40.4%).

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

TL;DR

<3-5 sentence high-level summary>

Abstract

Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos. However, the advancement of video-to-video (V2V) synthesis has been hampered by the challenge of maintaining temporal consistency across video frames. This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video. Contrary to prior methods that strictly adhere to optical flow, our approach harnesses its benefits while handling the imperfection in flow estimation. We encode the optical flow via warping from the first frame and serve it as a supplementary reference in the diffusion model. This enables our model for video synthesis by editing the first frame with any prevalent I2I models and then propagating edits to successive frames. Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility: FlowVid works seamlessly with existing I2I models, facilitating various modifications, including stylization, object swaps, and local edits. (2) Efficiency: Generation of a 4-second video with 30 FPS and 512x512 resolution takes only 1.5 minutes, which is 3.1x, 7.2x, and 10.5x faster than CoDeF, Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender (10.2%), and TokenFlow (40.4%).
Paper Structure (31 sections, 5 equations, 10 figures, 2 tables)

This paper contains 31 sections, 5 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: We present FlowVid to synthesize a consistent video given an input video and a target prompt. Our model supports multiple applications: (1) global stylization, such as converting the video to 2D anime (2) object swap, such as turning the panda into a koala bear (3) local edit, such as adding a pig nose to a panda.
  • Figure 2: (a) Input video: 'a man is running on beach'. (b) We edit the 1st frame with 'a man is running on Mars', then conduct flow warping from the 1st frame to the 10th and 20th frames (using input video flow). Flow estimation of legs is inaccurate. (c) Our FlowVid uses spatial controls to rectify the inaccurate flow. (d) Our consistent video synthesis results.
  • Figure 3: Overview of our FlowVid. (a) Training: we first get the spatial conditions (predicted depth maps) and estimated optical flow from the input video. For all frames, we use flow to perform warping from the first frame. The resulting flow-warped video is expected to have a similar structure as the input video but with some occluded regions (marked as gray, better zoomed in). We train a video diffusion model with spatial conditions $c$ and flow information $f$. (b) Generation: we edit the first frame with existing I2I models and use the flow in the input video to get the flow warped edited video. The flow condition spatial condition jointly guides the output video synthesis.
  • Figure 4: Effect of color calibration in autoregressive evaluation. (a) When the autoregressive evaluation goes from the 1st batch to the 13th batch, the results without color calibration become gray. (b) The results are more stable with the proposed color calibration.
  • Figure 5: Qualitative comparison with representative V2V models. Our method stands out in terms of prompt alignment and overall video quality. We highly encourage readers to refer to video comparisons in our supplementary videos.
  • ...and 5 more figures