Table of Contents
Fetching ...

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

Mengchen Zhang, Tong Wu, Jing Tan, Ziwei Liu, Gordon Wetzstein, Dahua Lin

TL;DR

This work tackles expressive camera trajectory generation by introducing DataDoP, a large multi-modal dataset of free-moving camera paths with depth and directorial captions, and GenDoP, an autoregressive Transformer that generates 3D camera trajectories conditioned on text and RGBD inputs. The method tokenizes camera poses into discrete tokens and uses a multimodal encoder to guide a decoder in producing coherent, intent-aligned trajectories. Quantitative and qualitative evaluations show GenDoP achieving superior text-trajectory alignment, trajectory quality, and robustness compared to existing baselines, including diffusion-based and object/scene-centric approaches. The approach enables fine-grained control for text-guided cinematography and paves the way for advanced, AI-assisted camera control in video generation pipelines.

Abstract

Camera trajectory design plays a crucial role in video production, serving as a fundamental tool for conveying directorial intent and enhancing visual storytelling. In cinematography, Directors of Photography meticulously craft camera movements to achieve expressive and intentional framing. However, existing methods for camera trajectory generation remain limited: Traditional approaches rely on geometric optimization or handcrafted procedural systems, while recent learning-based methods often inherit structural biases or lack textual alignment, constraining creative synthesis. In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. We first introduce DataDoP, a large-scale multi-modal dataset containing 29K real-world shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent. Thanks to the comprehensive and diverse database, we further train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movement generation based on text guidance and RGBD inputs, named GenDoP. Extensive experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability. We believe our approach establishes a new standard for learning-based cinematography, paving the way for future advancements in camera control and filmmaking. Our project website: https://kszpxxzmc.github.io/GenDoP/.

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

TL;DR

This work tackles expressive camera trajectory generation by introducing DataDoP, a large multi-modal dataset of free-moving camera paths with depth and directorial captions, and GenDoP, an autoregressive Transformer that generates 3D camera trajectories conditioned on text and RGBD inputs. The method tokenizes camera poses into discrete tokens and uses a multimodal encoder to guide a decoder in producing coherent, intent-aligned trajectories. Quantitative and qualitative evaluations show GenDoP achieving superior text-trajectory alignment, trajectory quality, and robustness compared to existing baselines, including diffusion-based and object/scene-centric approaches. The approach enables fine-grained control for text-guided cinematography and paves the way for advanced, AI-assisted camera control in video generation pipelines.

Abstract

Camera trajectory design plays a crucial role in video production, serving as a fundamental tool for conveying directorial intent and enhancing visual storytelling. In cinematography, Directors of Photography meticulously craft camera movements to achieve expressive and intentional framing. However, existing methods for camera trajectory generation remain limited: Traditional approaches rely on geometric optimization or handcrafted procedural systems, while recent learning-based methods often inherit structural biases or lack textual alignment, constraining creative synthesis. In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. We first introduce DataDoP, a large-scale multi-modal dataset containing 29K real-world shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent. Thanks to the comprehensive and diverse database, we further train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movement generation based on text guidance and RGBD inputs, named GenDoP. Extensive experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability. We believe our approach establishes a new standard for learning-based cinematography, paving the way for future advancements in camera control and filmmaking. Our project website: https://kszpxxzmc.github.io/GenDoP/.

Paper Structure

This paper contains 24 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview.Top: DataDoP data construction. Given RGB video frames, we extract RGBD images and camera poses, then tag the pose sequence with different motion categories (in different colors). With LLM, we generate two types of captions from motion tags and RGBD inputs: Motion Caption describes the camera movements, while Directorial Caption describes the camera movements along with their interaction with the scene and directorial intent. Bottom: Our GenDoP method supports multi-modal inputs for trajectory creation. The generated camera sequence can be easily applied to various video generation tasks, including text-to-video (T2V) cameractrl and image-to-video (I2V) generation camtrol. GenDoP paves the way for future advancements in camera-controlled video generation.
  • Figure 2: Dataset Statistics.(a) The figure illustrates the composition and distribution of 27 translation motions (left) and 7 rotation motions (right), emphasizing the complexity and diversity of trajectories in our DataDoP dataset. (b) Based on the same caption, our dataset includes diverse trajectories that still conform to the given caption. As shown in the figure, the trajectories exhibit variations in terms of length, direction, and speed, effectively showcasing the diversity within our dataset.
  • Figure 3: Our Auto-regressive Generation Model. Our model supports multi-modal inputs and generates trajectories based on these inputs. By treating the task as an auto-regressive next-token prediction problem, the model sequentially generates trajectories, with each new pose prediction influenced by previous camera states and input conditions.
  • Figure 4: Qualitative Results of Text-conditioned Trajectory Generation. We offer a comparative analysis of text-conditioned trajectory generation in the figure. Our model's trajectories (color-coded to highlight text alignment) remain stable and closely follow the instructions, while other models exhibit significant jitter or fail to match the instructions well.
  • Figure 5: Qualitative Results of RGBD & Text-conditioned Generation. This figure compares the impact of incorporating RGBD input on trajectory generation under identical text conditions. While both models generate command-compliant trajectories, the RGBD & Text-conditioned model demonstrates superior scene adaptation by utilizing RGBD data to integrate geometric and contextual constraints.
  • ...and 3 more figures