Table of Contents
Fetching ...

Geometry-Guided Camera Motion Understanding in VideoLLMs

Haoan Feng, Sri Harsha Musunuri, Guan-Ming Su

Abstract

Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $\textbf{benchmarking}$, $\textbf{diagnosis}$, and $\textbf{injection}$. We curate $\textbf{CameraMotionDataset}$, a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark--$\textbf{CameraMotionVQA}$. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark.

Geometry-Guided Camera Motion Understanding in VideoLLMs

Abstract

Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of , , and . We curate , a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark--. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark.
Paper Structure (21 sections, 3 equations, 7 figures, 3 tables)

This paper contains 21 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overall pipeline. Camera cues are extracted from a frozen 3DFM (VGGT) and passed to a Transformer-based temporal classifier to predict camera-motion primitives, and per-second motions are injected as a structured prompt field for VideoLLMs without modifying VideoLLM weights. For clarity, the probing and distillation pipelines are shown separately in Fig. \ref{['fig:probe_schematic']} and Fig. \ref{['fig:distill_schematic']}.
  • Figure 2: Flow chart for dataset/benchmark construction. From MultiCamVideo bai2025recammaster, video clips and camera extrinsics are preprocessed (split, resized, and normalized) and labeled with several motion constraints (e.g., incompatibility). A subset of shot segments is sampled based on motion primitives to balance classes, before storing as CameraMotionDataset and formulated into CameraMotionVQA records.
  • Figure 3: Probing experiment schematic. Query tokens $Q_t$ gather camera motion-related information from the projected intermediate visual features of the frozen vision encoder. Camera motion logits are predicted from the temporal convolution output of the transferred vision tokens $Q_t'$.
  • Figure 4: VGGT--Q-Former schematic. Camera tokens and visual tokens are both bottlenecked by a projection layer. Query tokens gather camera motion-related information via interleaved local-/global-attention blocks and regress the projected camera tokens to distill the camera perception capability of VGGT. Branch annotated with Stage $i$ indicating different training optimization objectives and flows of tokens.
  • Figure 5: Off-the-shelf VideoLLM performance on CameraMotionVQA. Horizontal bars report overall multiple-choice accuracy. All models, labeled by their Hugging Face model names, use an identical frame input and VQA prompt template.
  • ...and 2 more figures