FlowLoss: Dynamic Flow-Conditioned Loss Strategy for Video Diffusion Models
Kuanting Wu, Kei Ota, Asako Kanezaki
TL;DR
FlowLoss tackles temporal instability in Video Diffusion Models by directly aligning dense optical flow fields between generated and ground-truth videos. It introduces a differentiable flow loss $\mathcal{L}_{flow}$, modulated by a noise-aware gate $w_{\psi}(\sigma)$ and scaled with $\lambda(\sigma) = (\sigma^2+1)/\sigma^2$, integrated into the EDM-based denoising framework. Empirical results on robotic video data show faster early-stage convergence and improved motion stability, though final improvements are modest and depend on gating and flow-estimator quality. The work demonstrates a practical approach to incorporating motion-based supervision into noise-conditioned generative models, with implications for robotics and physics-aware video synthesis.
Abstract
Video Diffusion Models (VDMs) can generate high-quality videos, but often struggle with producing temporally coherent motion. Optical flow supervision is a promising approach to address this, with prior works commonly employing warping-based strategies that avoid explicit flow matching. In this work, we explore an alternative formulation, FlowLoss, which directly compares flow fields extracted from generated and ground-truth videos. To account for the unreliability of flow estimation under high-noise conditions in diffusion, we propose a noise-aware weighting scheme that modulates the flow loss across denoising steps. Experiments on robotic video datasets suggest that FlowLoss improves motion stability and accelerates convergence in early training stages. Our findings offer practical insights for incorporating motion-based supervision into noise-conditioned generative models.
