E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness
Robin Courant, Nicolas Dufour, Xi Wang, Marc Christie, Vicky Kalogeiton
TL;DR
This work tackles cinematic camera-trajectory generation by introducing the Exceptional Trajectories (E.T.) dataset, a large-scale collection of real-movie camera and character trajectories with rich captions. It proposes Director, a diffusion-based model conditioned on character motion and textual descriptions, and CLaTr, a robust language-trajectory embedding for evaluation, jointly advancing controllable, text-driven cinematography. The results show State-of-the-Art performance on both trajectory quality and caption coherence, with ablations highlighting the benefits of cross-attention conditioning. By enabling text-conditioned, character-aware camera generation on real cinematic data, the study democratizes cinematography and provides a foundation for future caption expressiveness and precise on-screen character targeting. The combination of E.T., Director, and CLaTr offers a comprehensive pipeline for training, generating, and evaluating cinematic camera trajectories in a scalable, multi-modal setting.
Abstract
Stories and emotions in movies emerge through the effect of well-thought-out directing decisions, in particular camera placement and movement over time. Crafting compelling camera trajectories remains a complex iterative process, even for skilful artists. To tackle this, in this paper, we propose a dataset called the Exceptional Trajectories (E.T.) with camera trajectories along with character information and textual captions encompassing descriptions of both camera and character. To our knowledge, this is the first dataset of its kind. To show the potential applications of the E.T. dataset, we propose a diffusion-based approach, named DIRECTOR, which generates complex camera trajectories from textual captions that describe the relation and synchronisation between the camera and characters. To ensure robust and accurate evaluations, we train on the E.T. dataset CLaTr, a Contrastive Language-Trajectory embedding for evaluation metrics. We posit that our proposed dataset and method significantly advance the democratization of cinematography, making it more accessible to common users.
