Table of Contents
Fetching ...

Towards Understanding Camera Motions in Any Video

Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan

TL;DR

This work tackles understanding camera motion in unconstrained videos by building CameraBench, a large, expert-curated dataset with a formal taxonomy of motion primitives and a robust label-then-caption annotation protocol. It reveals significant gaps in prior datasets due to ambiguous definitions and lack of quality control, and it demonstrates that expert-guided training substantially improves annotation reliability. Through extensive benchmarking, the authors show complementary strengths and weaknesses of SfM/SLAM approaches and vision-language models, and they achieve practical gains by fine-tuning a generative VLM to capture both semantic and geometric aspects of motion. The resulting resources enable motion-aware tasks like captioning, VQA, and retrieval, and lay groundwork for future integration of geometric and semantic understanding in video analysis and generation.

Abstract

We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.

Towards Understanding Camera Motions in Any Video

TL;DR

This work tackles understanding camera motion in unconstrained videos by building CameraBench, a large, expert-curated dataset with a formal taxonomy of motion primitives and a robust label-then-caption annotation protocol. It reveals significant gaps in prior datasets due to ambiguous definitions and lack of quality control, and it demonstrates that expert-guided training substantially improves annotation reliability. Through extensive benchmarking, the authors show complementary strengths and weaknesses of SfM/SLAM approaches and vision-language models, and they achieve practical gains by fine-tuning a generative VLM to capture both semantic and geometric aspects of motion. The resulting resources enable motion-aware tasks like captioning, VQA, and retrieval, and lay groundwork for future integration of geometric and semantic understanding in video analysis and generation.

Abstract

We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.

Paper Structure

This paper contains 13 sections, 18 figures, 26 tables.

Figures (18)

  • Figure 1: Examples of camera movements. We show videos with their camera trajectories: a tracking shot of a toddler (row 1, left), Hitchcock’s dolly zoom effect (row 2, left), Spielberg’s dramatic pan and tilt in Jurassic Park (row 3, left), Nolan’s roll shot in Inception (row 1, right), a pedestal-up shot from The Legend of Zelda (row 2, right), and a selfie by an amateur photographer, arcing to showcase the scenery while centering themselves (row 3, right). Please watch the videos at https://linzhiqiu.github.io/papers/camerabench.
  • Figure 2: Issues in previous camera motion datasets and our solutions. Existing work contains critical flaws: (1) Inaccurate specification, e.g., MovieNet movienetmovieshot conflating translation with rotation or zoom. (2) Contradictory annotations, e.g., AVE argaw2022anatomy labels over 1,000 clips as both static (locked) and moving (including pan and tilt). (3) No quality control, even recent VLM benchmarks tarsiertang2024vidcompositionchai2024auroracap contain major mistakes such as flipping motion direction. See \ref{['sec:prior_work_errors']} for analysis. \ref{['sec:dataset']} shows how we address them by working with professionals to design (1) a taxonomy via iterative refinement, (2) a reliable annotation framework for complex motion, and (3) a training program with expert oversight to improve data quality.
  • Figure 3: Taxonomy of camera motion primitives. Our taxonomy, developed in collaboration with cinematographers and vision researchers, is the first to comprehensively capture camera motion across object-, ground-, and camera-centric reference frames, using precise cinematography terms deguzman2020types to eliminate ambiguity. It covers camera steadiness, translation, rotation, intrinsic changes, and common object-centric movements, all detailed in this paper. We refine the taxonomy iteratively over three months by annotating real-world videos and incorporating feedback from researchers and cinematographers to ensure both accuracy and completeness.
  • Figure 4: Human study and training program. We hire $\sim$100 participants from diverse backgrounds, including non-expert with limited knowledge about camera movements and experts from the filmmaking industry with hands-on cinematography experience. Figure (a) shows the average accuracy of both groups in selecting motion primitives on 30 videos, where experts clearly outperform non-experts. In addition, around 80% of participants who review our multimodal guidelines (including textual definitions, video examples, and edge cases) significantly outperform the remaining 20% who only see textual definitions. Figure (b) shows that extended practice with detailed error feedback boosts accuracy for all participants. We hire only those who complete all five rounds (with 30 videos each) to annotate our dataset.
  • Figure 5: Example annotations. Our videos ( left) are annotated with binary labels for $\sim$50 camera motion primitives from our taxonomy, along with language descriptions capturing key motion aspects. We visualize the caption word cloud on the top-right and a pie chart of video genres on the bottom-right. Note that the other genre includes more tags such as dashcam, drone, selfie, ads, mixed media, animals, art, sports, lectures, screen recordings, and etc. See https://linzhiqiu.github.io/papers/camerabench/ for videos.
  • ...and 13 more figures