Table of Contents
Fetching ...

E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness

Robin Courant, Nicolas Dufour, Xi Wang, Marc Christie, Vicky Kalogeiton

TL;DR

This work tackles cinematic camera-trajectory generation by introducing the Exceptional Trajectories (E.T.) dataset, a large-scale collection of real-movie camera and character trajectories with rich captions. It proposes Director, a diffusion-based model conditioned on character motion and textual descriptions, and CLaTr, a robust language-trajectory embedding for evaluation, jointly advancing controllable, text-driven cinematography. The results show State-of-the-Art performance on both trajectory quality and caption coherence, with ablations highlighting the benefits of cross-attention conditioning. By enabling text-conditioned, character-aware camera generation on real cinematic data, the study democratizes cinematography and provides a foundation for future caption expressiveness and precise on-screen character targeting. The combination of E.T., Director, and CLaTr offers a comprehensive pipeline for training, generating, and evaluating cinematic camera trajectories in a scalable, multi-modal setting.

Abstract

Stories and emotions in movies emerge through the effect of well-thought-out directing decisions, in particular camera placement and movement over time. Crafting compelling camera trajectories remains a complex iterative process, even for skilful artists. To tackle this, in this paper, we propose a dataset called the Exceptional Trajectories (E.T.) with camera trajectories along with character information and textual captions encompassing descriptions of both camera and character. To our knowledge, this is the first dataset of its kind. To show the potential applications of the E.T. dataset, we propose a diffusion-based approach, named DIRECTOR, which generates complex camera trajectories from textual captions that describe the relation and synchronisation between the camera and characters. To ensure robust and accurate evaluations, we train on the E.T. dataset CLaTr, a Contrastive Language-Trajectory embedding for evaluation metrics. We posit that our proposed dataset and method significantly advance the democratization of cinematography, making it more accessible to common users.

E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness

TL;DR

This work tackles cinematic camera-trajectory generation by introducing the Exceptional Trajectories (E.T.) dataset, a large-scale collection of real-movie camera and character trajectories with rich captions. It proposes Director, a diffusion-based model conditioned on character motion and textual descriptions, and CLaTr, a robust language-trajectory embedding for evaluation, jointly advancing controllable, text-driven cinematography. The results show State-of-the-Art performance on both trajectory quality and caption coherence, with ablations highlighting the benefits of cross-attention conditioning. By enabling text-conditioned, character-aware camera generation on real cinematic data, the study democratizes cinematography and provides a foundation for future caption expressiveness and precise on-screen character targeting. The combination of E.T., Director, and CLaTr offers a comprehensive pipeline for training, generating, and evaluating cinematic camera trajectories in a scalable, multi-modal setting.

Abstract

Stories and emotions in movies emerge through the effect of well-thought-out directing decisions, in particular camera placement and movement over time. Crafting compelling camera trajectories remains a complex iterative process, even for skilful artists. To tackle this, in this paper, we propose a dataset called the Exceptional Trajectories (E.T.) with camera trajectories along with character information and textual captions encompassing descriptions of both camera and character. To our knowledge, this is the first dataset of its kind. To show the potential applications of the E.T. dataset, we propose a diffusion-based approach, named DIRECTOR, which generates complex camera trajectories from textual captions that describe the relation and synchronisation between the camera and characters. To ensure robust and accurate evaluations, we train on the E.T. dataset CLaTr, a Contrastive Language-Trajectory embedding for evaluation metrics. We posit that our proposed dataset and method significantly advance the democratization of cinematography, making it more accessible to common users.
Paper Structure (36 sections, 13 equations, 12 figures, 3 tables)

This paper contains 36 sections, 13 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Different results generated by our camera trajectory diffusion system. Project page https://www.lix.polytechnique.fr/vista/projects/2024_et_courant.
  • Figure 2: Examples E.T. samples. Each subfigure presents frames from the original movie shot on the left, while the right side depicts the extracted and processed camera and character trajectories. Additionally, the bottom part showcases the generated camera trajectory caption with or without the character trajectory.
  • Figure 2: Quantitative Results. Comparison of Director and concurrent methods on E.T. pure and mixed subsets, evaluating trajectory quality (left) and caption coherence (right). First best and second best.
  • Figure 3: Dataset creation pipeline. Given RGB frames from a video, we first extract and pre-process camera and character poses, then tag resulting camera and character trajectories (sequence of poses) to obtain rough independent descriptions (middle part). Finally, we translate these descriptions into rich textual captions, aligning the camera trajectory with that of the character (right part).
  • Figure 4: E.T. statistics.
  • ...and 7 more figures