Table of Contents
Fetching ...

CPA: Camera-pose-awareness Diffusion Transformer for Video Generation

Yuelei Wang, Jian Zhang, Pengtao Jiang, Hao Zhang, Jinwei Chen, Bo Li

TL;DR

The paper addresses the challenge of precise camera-pose control in diffusion-transformer video generation. It introduces CPA, which encodes per-frame camera poses as a $12$-dimensional motion representation via Plücker coordinates, converts them into a sparse motion field with Sparse Motion Encoding (SME), and injects the resulting pose latent into temporal attention through Temporal Attention Injection (TAI), guided by a pose latent VAE. Through a two-stage training regime on RealEstate10K and careful fine-tuning of OpenSora, CPA achieves state-of-the-art performance for long-video generation, improving trajectory fidelity and object-consistency while preserving high visual quality. This camera-pose-aware diffusion framework enables flexible, controllable video synthesis with potential applications in creative, AR/VR, and cinematic contexts.

Abstract

Despite the significant advancements made by Diffusion Transformer (DiT)-based methods in video generation, there remains a notable gap with controllable camera pose perspectives. Existing works such as OpenSora do NOT adhere precisely to anticipated trajectories and physical interactions, thereby limiting the flexibility in downstream applications. To alleviate this issue, we introduce CPA, a unified camera-pose-awareness text-to-video generation approach that elaborates the camera movement and integrates the textual, visual, and spatial conditions. Specifically, we deploy the Sparse Motion Encoding (SME) module to transform camera pose information into a spatial-temporal embedding and activate the Temporal Attention Injection (TAI) module to inject motion patches into each ST-DiT block. Our plug-in architecture accommodates the original DiT parameters, facilitating diverse types of camera poses and flexible object movement. Extensive qualitative and quantitative experiments demonstrate that our method outperforms LDM-based methods for long video generation while achieving optimal performance in trajectory consistency and object consistency.

CPA: Camera-pose-awareness Diffusion Transformer for Video Generation

TL;DR

The paper addresses the challenge of precise camera-pose control in diffusion-transformer video generation. It introduces CPA, which encodes per-frame camera poses as a -dimensional motion representation via Plücker coordinates, converts them into a sparse motion field with Sparse Motion Encoding (SME), and injects the resulting pose latent into temporal attention through Temporal Attention Injection (TAI), guided by a pose latent VAE. Through a two-stage training regime on RealEstate10K and careful fine-tuning of OpenSora, CPA achieves state-of-the-art performance for long-video generation, improving trajectory fidelity and object-consistency while preserving high visual quality. This camera-pose-aware diffusion framework enables flexible, controllable video synthesis with potential applications in creative, AR/VR, and cinematic contexts.

Abstract

Despite the significant advancements made by Diffusion Transformer (DiT)-based methods in video generation, there remains a notable gap with controllable camera pose perspectives. Existing works such as OpenSora do NOT adhere precisely to anticipated trajectories and physical interactions, thereby limiting the flexibility in downstream applications. To alleviate this issue, we introduce CPA, a unified camera-pose-awareness text-to-video generation approach that elaborates the camera movement and integrates the textual, visual, and spatial conditions. Specifically, we deploy the Sparse Motion Encoding (SME) module to transform camera pose information into a spatial-temporal embedding and activate the Temporal Attention Injection (TAI) module to inject motion patches into each ST-DiT block. Our plug-in architecture accommodates the original DiT parameters, facilitating diverse types of camera poses and flexible object movement. Extensive qualitative and quantitative experiments demonstrate that our method outperforms LDM-based methods for long video generation while achieving optimal performance in trajectory consistency and object consistency.

Paper Structure

This paper contains 16 sections, 11 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The relevance of this work to video generation models. (a) The DiT-based video generation model leverages DiT blocks to produce high-quality videos. (b) CPA utilizes Plücker coordinates encoded with camera pose information and aligns with the attention mechanism in the DiT block to generate camera-oriented videos.
  • Figure 2: The overview of CPA. CPA includes the Sparse Motion Encoding (SME) Module and the Temporal Attention Injection (TAI) Module. It establishes a sparse motion sequence representation based on Plücker coordinates and feeds it into the VAE for pose latent, handling the camera pose sequences for multiple frames. By employing layer normalization and MLP, it achieves alignment of the temporal attention layer and the pose latent. The inputs of the video and text caption are consistent with OpenSora, feeding into the ST-DiT and cross-attention layers through the 3D-VAE and T5 models, respectively.
  • Figure 3: The pipeline of camera pose sequences encoding. The matrix parameters between adjacent frames are calculated to obtain the camera pose sequence, which is then transformed into RGB space through the sparse motion field and finally processed into pose latent by the VAE.
  • Figure 4: Temporal Attention Injection Module. Layer normalization (LN) and multi-layer perceptron (MLP) are used during processing temporal attention features and pose latent orientation, respectively.
  • Figure 5: A visualization for camera pose series. We visualize image sequence after sparse motion sampling, with each row representing frame 0, frame 5, frame 10, and frame 15 (final frame) of the camera pose series from left to right. The arrows in the image indicate the motion of the sampling points. The first row shows a camera zoom-in motion, and the second row shows a pan-right motion.
  • ...and 4 more figures