Unified Dense Prediction of Video Diffusion
Lehan Yang, Lu Qi, Xiangtai Li, Sheng Li, Varun Jampani, Ming-Hsuan Yang
TL;DR
UDP Diffusion (UDPDiff) addresses the challenge of generating coherent video content while jointly predicting dense scene properties by introducing Pixelplanes as a unified dense representation for segmentation and depth. The method fuses video generation and dense prediction in a single diffusion framework, guided by a learnable task embedding and a multi-task objective, built on a 3D VAE backbone and a Transformer denoiser. A new Panda-Dense dataset provides large-scale, captioned video data with segmentation and depth annotations to support training, enabling effective multi-task learning. Empirical results show improved video quality, temporal consistency, and motion smoothness, with joint multi-task training delivering additional gains and minimal inference overhead, signaling strong potential for editing and downstream tasks. The work also demonstrates a practical data-and-model design that can scale dense-prediction-guided video generation in real-world applications.
Abstract
We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts. We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation. Introducing dense prediction information improves video generation's consistency and motion smoothness without increasing computational costs. Incorporating learnable task embeddings brings multiple dense prediction tasks into a single model, enhancing flexibility and further boosting performance. We further propose a large-scale dense prediction video dataset~\datasetname, addressing the issue that existing datasets do not concurrently contain captions, videos, segmentation, or depth maps. Comprehensive experiments demonstrate the high efficiency of our method, surpassing the state-of-the-art in terms of video quality, consistency, and motion smoothness.
