Table of Contents
Fetching ...

CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization

Zelin Zhao, Xinyu Gong, Bangya Liu, Ziyang Song, Jun Zhang, Suhui Wu, Yongxin Chen, Hao Zhang

TL;DR

<3-5 sentence high-level summary> CETCam addresses the challenge of camera-controlled video generation without pose annotations by introducing a geometry-aware tokenization pipeline that uses depth and pose estimates from VGGT to produce renderings, masks, and camera embeddings. These CETCam tokens are integrated into a frozen diffusion backbone through lightweight CETCam context blocks, enabling plug-and-play conditioning with other modalities. The training proceeds in two phases—broad learning from diverse raw videos and fine-tuning on high-fidelity data—achieving superior geometric consistency, temporal stability, and visual realism across benchmarks. Moreover, CETCam is inherently extensible, demonstrated by its seamless integration with VACE for additional controls like inpainting and layout, expanding the scope of controllable video generation beyond camera motion.

Abstract

Achieving precise camera control in video generation remains challenging, as existing methods often rely on camera pose annotations that are difficult to scale to large and dynamic datasets and are frequently inconsistent with depth estimation, leading to train-test discrepancies. We introduce CETCAM, a camera-controllable video generation framework that eliminates the need for camera annotations through a consistent and extensible tokenization scheme. CETCAM leverages recent advances in geometry foundation models, such as VGGT, to estimate depth and camera parameters and converts them into unified, geometry-aware tokens. These tokens are seamlessly integrated into a pretrained video diffusion backbone via lightweight context blocks. Trained in two progressive stages, CETCAM first learns robust camera controllability from diverse raw video data and then refines fine-grained visual quality using curated high-fidelity datasets. Extensive experiments across multiple benchmarks demonstrate state-of-the-art geometric consistency, temporal stability, and visual realism. Moreover, CETCAM exhibits strong adaptability to additional control modalities, including inpainting and layout control, highlighting its flexibility beyond camera control. The project page is available at https://sjtuytc.github.io/CETCam_project_page.github.io/.

CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization

TL;DR

<3-5 sentence high-level summary> CETCam addresses the challenge of camera-controlled video generation without pose annotations by introducing a geometry-aware tokenization pipeline that uses depth and pose estimates from VGGT to produce renderings, masks, and camera embeddings. These CETCam tokens are integrated into a frozen diffusion backbone through lightweight CETCam context blocks, enabling plug-and-play conditioning with other modalities. The training proceeds in two phases—broad learning from diverse raw videos and fine-tuning on high-fidelity data—achieving superior geometric consistency, temporal stability, and visual realism across benchmarks. Moreover, CETCam is inherently extensible, demonstrated by its seamless integration with VACE for additional controls like inpainting and layout, expanding the scope of controllable video generation beyond camera motion.

Abstract

Achieving precise camera control in video generation remains challenging, as existing methods often rely on camera pose annotations that are difficult to scale to large and dynamic datasets and are frequently inconsistent with depth estimation, leading to train-test discrepancies. We introduce CETCAM, a camera-controllable video generation framework that eliminates the need for camera annotations through a consistent and extensible tokenization scheme. CETCAM leverages recent advances in geometry foundation models, such as VGGT, to estimate depth and camera parameters and converts them into unified, geometry-aware tokens. These tokens are seamlessly integrated into a pretrained video diffusion backbone via lightweight context blocks. Trained in two progressive stages, CETCAM first learns robust camera controllability from diverse raw video data and then refines fine-grained visual quality using curated high-fidelity datasets. Extensive experiments across multiple benchmarks demonstrate state-of-the-art geometric consistency, temporal stability, and visual realism. Moreover, CETCAM exhibits strong adaptability to additional control modalities, including inpainting and layout control, highlighting its flexibility beyond camera control. The project page is available at https://sjtuytc.github.io/CETCam_project_page.github.io/.

Paper Structure

This paper contains 34 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Camera-controlled Video Generation. Our framework synthesizes dynamic, geometry-consistent scenes across diverse domains, including human face, animal motion, and natural environments. It allows accurate camera control across wide viewpoints, with full 360° orbit motion shown in the last example. Please refer to the project page for videos, code and pre-trained models.
  • Figure 2: Overview of CETCam I2V generation framework. (a) CETCam Tokenizer. Given an in-the-wild training video or a test-time frame input, the frames are processed by VGGTvggt to predict the depth maps. In training, camera poses are also estimated. Predicted depths and camera poses are used for point cloud reprojection to generate renderings of the first frame and corresponding masks. Renderings, masks, and camera poses are embedded and fused to produce CETCam tokens. (b) Token-Based Controlled Video Generation. We leverage various tokens with rich and diverse functions via different tokens, including CETCam tokens, noisy latents, VCU tokens, and VACE tokens vace. CETCam tokens are consumed in learnable CETCam context blocks, which were connected to pre-trained Wan DiT blocks wan2025 with zero linear and add functions cao2025uni3c. Other tokens are further processed by VACE context blocks and Wan DiT blocks. Finally, the output tokens are decoded through a 3D VAE to generate a video wan2025. More details can be found in \ref{['sec-tokenizer', 'sec-controlledgen']}.
  • Figure 3: Comparison with the closest concurrent work Uni3Ccao2025uni3c.Left:Uni3C renderings fail to accurately follow the intended camera motion and exhibit geometric distortions and outliers due to inconsistent 3D estimation. Right: These inaccuracies in renderings lead to spatial misalignment and visible artifacts in the generated videos of Uni3C, while our generated videos do not exhibit these artifacts.
  • Figure 4: Camera control results.Top: Four videos generated under the same camera trajectory (source images omitted for brevity). Bottom: Results with the same source image but different camera trajectories, illustrating consistent control across four distinct motions.
  • Figure 5: Extensibility Results. We show visualization results on controlling camera motion while achieving additional control through VACEvace. (a) Object replacement (with prompt “replace the dragon with a phoenix”) using a source image and a masked image as VACE control, (b) Recolorization of a green portrait guided by a gray image control, (c) Realization of a virtual scene from a scribble as VACE control, and (d) Object swap guided by a provided reference image. Please refer to the supplementary for more demo videos.