Table of Contents
Fetching ...

FlowLoss: Dynamic Flow-Conditioned Loss Strategy for Video Diffusion Models

Kuanting Wu, Kei Ota, Asako Kanezaki

TL;DR

FlowLoss tackles temporal instability in Video Diffusion Models by directly aligning dense optical flow fields between generated and ground-truth videos. It introduces a differentiable flow loss $\mathcal{L}_{flow}$, modulated by a noise-aware gate $w_{\psi}(\sigma)$ and scaled with $\lambda(\sigma) = (\sigma^2+1)/\sigma^2$, integrated into the EDM-based denoising framework. Empirical results on robotic video data show faster early-stage convergence and improved motion stability, though final improvements are modest and depend on gating and flow-estimator quality. The work demonstrates a practical approach to incorporating motion-based supervision into noise-conditioned generative models, with implications for robotics and physics-aware video synthesis.

Abstract

Video Diffusion Models (VDMs) can generate high-quality videos, but often struggle with producing temporally coherent motion. Optical flow supervision is a promising approach to address this, with prior works commonly employing warping-based strategies that avoid explicit flow matching. In this work, we explore an alternative formulation, FlowLoss, which directly compares flow fields extracted from generated and ground-truth videos. To account for the unreliability of flow estimation under high-noise conditions in diffusion, we propose a noise-aware weighting scheme that modulates the flow loss across denoising steps. Experiments on robotic video datasets suggest that FlowLoss improves motion stability and accelerates convergence in early training stages. Our findings offer practical insights for incorporating motion-based supervision into noise-conditioned generative models.

FlowLoss: Dynamic Flow-Conditioned Loss Strategy for Video Diffusion Models

TL;DR

FlowLoss tackles temporal instability in Video Diffusion Models by directly aligning dense optical flow fields between generated and ground-truth videos. It introduces a differentiable flow loss , modulated by a noise-aware gate and scaled with , integrated into the EDM-based denoising framework. Empirical results on robotic video data show faster early-stage convergence and improved motion stability, though final improvements are modest and depend on gating and flow-estimator quality. The work demonstrates a practical approach to incorporating motion-based supervision into noise-conditioned generative models, with implications for robotics and physics-aware video synthesis.

Abstract

Video Diffusion Models (VDMs) can generate high-quality videos, but often struggle with producing temporally coherent motion. Optical flow supervision is a promising approach to address this, with prior works commonly employing warping-based strategies that avoid explicit flow matching. In this work, we explore an alternative formulation, FlowLoss, which directly compares flow fields extracted from generated and ground-truth videos. To account for the unreliability of flow estimation under high-noise conditions in diffusion, we propose a noise-aware weighting scheme that modulates the flow loss across denoising steps. Experiments on robotic video datasets suggest that FlowLoss improves motion stability and accelerates convergence in early training stages. Our findings offer practical insights for incorporating motion-based supervision into noise-conditioned generative models.

Paper Structure

This paper contains 6 sections, 5 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of the FlowLoss supervision framework. The model is supervised by two gradient flows---one from pixel-level reconstruction $\mathcal{L}_\text{recon}$ and one from optical flow consistency $\mathcal{L}_\text{flow}$, which is adjusted by a dynamic weighting $w(\sigma)$, enabling it to generate visually plausible videos with coherent motion dynamics.
  • Figure 2: Panels (a) and (b) show $\mathcal{L}_{\text{recon}}$ and $\mathcal{L}_{\text{flow}}$ computed on a single validation sample, using a VDM built upon the UNet backbone from https://huggingface.co/stabilityai/stable-video-diffusion-img2vid. (a) The original EDM defines the reconstruction loss as $\mathcal{L}_{\text{recon}} = \lambda(\sigma) \cdot \mathcal{L}_{\text{MSE}}$, where $\lambda(\sigma)$ increases as $\sigma$ decreases, encouraging fine-detail reconstruction during low-noise steps. (b) Variants of our loss function. (c) Corresponding weighting strategies. The $w_\psi(\sigma)$ function clips off $\mathcal{L}_{\text{flow}}$ contributions entirely when $\sigma$ exceeds a threshold, balancing cost and supervision strength. (d) Distribution of sampled $\sigma$ values using EDM’s noise prior. Dashed lines show $\psi$ thresholds; percentages indicate the portion of steps where $\mathcal{L}_{\text{flow}}$ is applied under $w_\psi(\sigma)$.
  • Figure 3: Higher noise scales $\sigma$ lead to corrupted inputs and degraded flow extraction, motivating our noise-aware flow loss design.
  • Figure 4: Validation performance over training steps for different flow loss strategies.
  • Figure 5: Comparison of early-stage generation results (step = 100) across different training objectives. Ground Truth refers to the original robot video sequence. Ours uses the full loss function $\mathcal{L}_{\text{recon}} + w(\sigma) \cdot \mathcal{L}_{\text{flow}}$, while Recon Only trains with reconstruction loss $\mathcal{L}_{\text{recon}}$ alone. Our method exhibits improved motion stability and temporal coherence at an early diffusion step, whereas the baseline suffers from spatial drift and jitter.
  • ...and 1 more figures