Table of Contents
Fetching ...

Unified Dense Prediction of Video Diffusion

Lehan Yang, Lu Qi, Xiangtai Li, Sheng Li, Varun Jampani, Ming-Hsuan Yang

TL;DR

UDP Diffusion (UDPDiff) addresses the challenge of generating coherent video content while jointly predicting dense scene properties by introducing Pixelplanes as a unified dense representation for segmentation and depth. The method fuses video generation and dense prediction in a single diffusion framework, guided by a learnable task embedding and a multi-task objective, built on a 3D VAE backbone and a Transformer denoiser. A new Panda-Dense dataset provides large-scale, captioned video data with segmentation and depth annotations to support training, enabling effective multi-task learning. Empirical results show improved video quality, temporal consistency, and motion smoothness, with joint multi-task training delivering additional gains and minimal inference overhead, signaling strong potential for editing and downstream tasks. The work also demonstrates a practical data-and-model design that can scale dense-prediction-guided video generation in real-world applications.

Abstract

We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts. We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation. Introducing dense prediction information improves video generation's consistency and motion smoothness without increasing computational costs. Incorporating learnable task embeddings brings multiple dense prediction tasks into a single model, enhancing flexibility and further boosting performance. We further propose a large-scale dense prediction video dataset~\datasetname, addressing the issue that existing datasets do not concurrently contain captions, videos, segmentation, or depth maps. Comprehensive experiments demonstrate the high efficiency of our method, surpassing the state-of-the-art in terms of video quality, consistency, and motion smoothness.

Unified Dense Prediction of Video Diffusion

TL;DR

UDP Diffusion (UDPDiff) addresses the challenge of generating coherent video content while jointly predicting dense scene properties by introducing Pixelplanes as a unified dense representation for segmentation and depth. The method fuses video generation and dense prediction in a single diffusion framework, guided by a learnable task embedding and a multi-task objective, built on a 3D VAE backbone and a Transformer denoiser. A new Panda-Dense dataset provides large-scale, captioned video data with segmentation and depth annotations to support training, enabling effective multi-task learning. Empirical results show improved video quality, temporal consistency, and motion smoothness, with joint multi-task training delivering additional gains and minimal inference overhead, signaling strong potential for editing and downstream tasks. The work also demonstrates a practical data-and-model design that can scale dense-prediction-guided video generation in real-world applications.

Abstract

We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts. We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation. Introducing dense prediction information improves video generation's consistency and motion smoothness without increasing computational costs. Incorporating learnable task embeddings brings multiple dense prediction tasks into a single model, enhancing flexibility and further boosting performance. We further propose a large-scale dense prediction video dataset~\datasetname, addressing the issue that existing datasets do not concurrently contain captions, videos, segmentation, or depth maps. Comprehensive experiments demonstrate the high efficiency of our method, surpassing the state-of-the-art in terms of video quality, consistency, and motion smoothness.

Paper Structure

This paper contains 19 sections, 2 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Visualization results of multi-task UDPDiff model on image generation and dense prediction. Our model can generate the video and the corresponding dense estimation. We incorporate two tasks in one multi-task model, including video entity segmentation and video depth estimation. Both segmentation and depth map have been encoded into RGB format as a video sequence, using Pixelplanes.
  • Figure 2: Overview of the Panda-Dense pipeline and the UDPDiff framework.Left: For segmentation, we use the first frame's results from EntitySeg as a prompt for SAM2, which then performs video segmentation across the entire sequence. For depth estimation, DepthCrafter is used to generate video depth maps. For long prompts, Video LLaVA is used for captioning. Right: Similar to CogVideoX, our method UDPDiff denoises the feature sequence in the latent space, encoding and decoding the latent using a 3D VAE. Video generation and dense prediction share a similar paradigm, using the same VAE for encoding and decoding through a unified representation. Task embeddings are applied to the time step embeddings, enabling more powerful differentiation of various tasks under a multi-task joint training model.
  • Figure 3: Consistency qualitative comparison between CogVideoX and UDPDiff. Six frames are evenly sampled from the generated video, with the horizontal axis representing time and the frame index gradually increasing. Inconsistent parts are annotated with red bounding boxes, including disappearances, color changes, and shape changes.
  • Figure 4: Quality qualitative comparison between CogVideoX and UDPDiff. Six frames are evenly sampled from the generated video, with the horizontal axis representing time and the frame index gradually increasing. Our advantages are reflected in clearer and sharper entities, more realistic motion, and better generation of dense scenes.
  • Figure 5: Visualization of the video generation associated with dense prediction. For each sample, we evenly sample four frames, with the left column of each sample representing the generated video and the right column representing the generated dense prediction. The left half of the samples is video entity segmentation, and the right half is video depth estimation. Our model can generate high-quality, dense predictions simultaneously with almost no increase in computational cost.
  • ...and 6 more figures