Table of Contents
Fetching ...

Learning Camera Movement Control from Real-World Drone Videos

Yunzhong Hou, Liang Zheng, Philip Torr

TL;DR

This work tackles automating drone cinematography by predicting camera movements for filming existing subjects rather than generating pixels. It introduces DroneMotion-99K, a large-scale real-world dataset of 3D camera trajectories extracted from online videos, and DVGFormer, an autoregressive transformer that uses long-horizon inputs to predict next-frame camera motion at a 3–15 Hz cadence. The approach outperforms a RT-1–based baseline on 184 rendered sequences in terms of user preference, collision rates, and motion smoothness, demonstrating effective long-horizon planning and scene-adaptive dynamics. The work enables scalable, data-driven AI cinematography with practical implications for automated videography and drone-based content creation.

Abstract

This study seeks to automate camera movement control for filming existing subjects into attractive videos, contrasting with the creation of non-existent content by directly generating the pixels. We select drone videos as our test case due to their rich and challenging motion patterns, distinctive viewing angles, and precise controls. Existing AI videography methods struggle with limited appearance diversity in simulation training, high costs of recording expert operations, and difficulties in designing heuristic-based goals to cover all scenarios. To avoid these issues, we propose a scalable method that involves collecting real-world training data to improve diversity, extracting camera trajectories automatically to minimize annotation costs, and training an effective architecture that does not rely on heuristics. Specifically, we collect 99k high-quality trajectories by running 3D reconstruction on online videos, connecting camera poses from consecutive frames to formulate 3D camera paths, and using Kalman filter to identify and remove low-quality data. Moreover, we introduce DVGFormer, an auto-regressive transformer that leverages the camera path and images from all past frames to predict camera movement in the next frame. We evaluate our system across 38 synthetic natural scenes and 7 real city 3D scans. We show that our system effectively learns to perform challenging camera movements such as navigating through obstacles, maintaining low altitude to increase perceived speed, and orbiting towers and buildings, which are very useful for recording high-quality videos. Data and code are available at dvgformer.github.io.

Learning Camera Movement Control from Real-World Drone Videos

TL;DR

This work tackles automating drone cinematography by predicting camera movements for filming existing subjects rather than generating pixels. It introduces DroneMotion-99K, a large-scale real-world dataset of 3D camera trajectories extracted from online videos, and DVGFormer, an autoregressive transformer that uses long-horizon inputs to predict next-frame camera motion at a 3–15 Hz cadence. The approach outperforms a RT-1–based baseline on 184 rendered sequences in terms of user preference, collision rates, and motion smoothness, demonstrating effective long-horizon planning and scene-adaptive dynamics. The work enables scalable, data-driven AI cinematography with practical implications for automated videography and drone-based content creation.

Abstract

This study seeks to automate camera movement control for filming existing subjects into attractive videos, contrasting with the creation of non-existent content by directly generating the pixels. We select drone videos as our test case due to their rich and challenging motion patterns, distinctive viewing angles, and precise controls. Existing AI videography methods struggle with limited appearance diversity in simulation training, high costs of recording expert operations, and difficulties in designing heuristic-based goals to cover all scenarios. To avoid these issues, we propose a scalable method that involves collecting real-world training data to improve diversity, extracting camera trajectories automatically to minimize annotation costs, and training an effective architecture that does not rely on heuristics. Specifically, we collect 99k high-quality trajectories by running 3D reconstruction on online videos, connecting camera poses from consecutive frames to formulate 3D camera paths, and using Kalman filter to identify and remove low-quality data. Moreover, we introduce DVGFormer, an auto-regressive transformer that leverages the camera path and images from all past frames to predict camera movement in the next frame. We evaluate our system across 38 synthetic natural scenes and 7 real city 3D scans. We show that our system effectively learns to perform challenging camera movements such as navigating through obstacles, maintaining low altitude to increase perceived speed, and orbiting towers and buildings, which are very useful for recording high-quality videos. Data and code are available at dvgformer.github.io.

Paper Structure

This paper contains 21 sections, 1 equation, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Examples of recorded videos from our AI cameraman. Instead of generating non-existent content directly in the pixel space, our system outputs camera movements to film existing subjects into aesthetically pleasing videos.
  • Figure 2: Data collection pipeline. Top left: For scraped YouTube videos, we run shot change detection PySceneDetect to split the videos into clips of individual scene. Top right: We then use Colmap schoenberger2016sfm to reconstruct the 3D scene and recover camera poses from video frames. Bottom: Finally, we connect camera poses from consecutive frames to formulate 3D camera trajectories and apply Kalman filter wan2000unscented to discard low quality reconstructions whose camera poses from neighboring frames are drastically different.
  • Figure 3: Threshold selection for identifying low-quality 3D reconstructions with unreasonable camera movements between consecutive frames. Left: We label the correctness of $\sim$1k Colmap reconstructions via our interactive 3D annotation tool by reviewing the reconstruction result and the original video clip side-by-side. Right: We gather statistics (ROC curve, precision, and recall) on the distance of camera locations to the smoothed camera path from Kalman filter, and select a threshold (red star) that best separates correct and incorrect reconstructions.
  • Figure 4: Model overview of DVGFormer. To predict camera motion $\bm{a}_t$ for time step $t$, the auto-regressive architecture uses as input a long horizon with camera poses $\left\{\bm{c}_0, ..., \bm{c}_t \right\}$, motion $\left\{\bm{a}_0, ..., \bm{a}_{t-1}\right\}$, images $\left\{\bm{x}_0, ..., \bm{x}_t\right\}$ and their monocular depth estimations from all previous frames. Each action $\bm{a}_t$ is broken into $N$ intermediate steps $\left\{\bm{a}_t^0, ..., \bm{a}_t^{N-1}\right\}$ between time step $t$ and $t+1$.
  • Figure 5: Visualization of the recorded videos. DVGFormer learns techniques like keeping the actor in frame, navigating through obstacles, maintaining low altitude to increase perceived speed, orbiting tower and buildings, or increasing altitude and pitching down camera for a full view, all directly from the DroneMotion-99k dataset and without any heuristics.
  • ...and 5 more figures