Table of Contents
Fetching ...

VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation

Mengtian Li, Yuwei Lu, Feifei Li, Chenqi Gan, Zhifeng Xie, Xi Wang

Abstract

Cinematic camera control relies on a tight feedback loop between director and cinematographer, where camera motion and framing are continuously reviewed and refined. Recent generative camera systems can produce diverse, text-conditioned trajectories, but they lack this "director in the loop" and have no explicit supervision of whether a shot is visually desirable. This results in in-distribution camera motion but poor framing, off-screen characters, and undesirable visual aesthetics. In this paper, we introduce VERTIGO, the first framework for visual preference optimization of camera trajectory generators. Our framework leverages a real-time graphics engine (Unity) to render 2D visual previews from generated camera motion. A cinematically fine-tuned vision-language model then scores these previews using our proposed cyclic semantic similarity mechanism, which aligns renders with text prompts. This process provides the visual preference signals for Direct Preference Optimization (DPO) post-training. Both quantitative evaluations and user studies on Unity renders and diffusion-based Camera-to-Video pipelines show consistent gains in condition adherence, framing quality, and perceptual realism. Notably, VERTIGO reduces the character off-screen rate from 38% to nearly 0% while preserving the geometric fidelity of camera motion. User study participants further prefer VERTIGO over baselines across composition, consistency, prompt adherence, and aesthetic quality, confirming the perceptual benefits of our visual preference post-training.

VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation

Abstract

Cinematic camera control relies on a tight feedback loop between director and cinematographer, where camera motion and framing are continuously reviewed and refined. Recent generative camera systems can produce diverse, text-conditioned trajectories, but they lack this "director in the loop" and have no explicit supervision of whether a shot is visually desirable. This results in in-distribution camera motion but poor framing, off-screen characters, and undesirable visual aesthetics. In this paper, we introduce VERTIGO, the first framework for visual preference optimization of camera trajectory generators. Our framework leverages a real-time graphics engine (Unity) to render 2D visual previews from generated camera motion. A cinematically fine-tuned vision-language model then scores these previews using our proposed cyclic semantic similarity mechanism, which aligns renders with text prompts. This process provides the visual preference signals for Direct Preference Optimization (DPO) post-training. Both quantitative evaluations and user studies on Unity renders and diffusion-based Camera-to-Video pipelines show consistent gains in condition adherence, framing quality, and perceptual realism. Notably, VERTIGO reduces the character off-screen rate from 38% to nearly 0% while preserving the geometric fidelity of camera motion. User study participants further prefer VERTIGO over baselines across composition, consistency, prompt adherence, and aesthetic quality, confirming the perceptual benefits of our visual preference post-training.

Paper Structure

This paper contains 25 sections, 5 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overview of VERTIGO. We present an integrated framework that converts film scripts into 3D camera trajectories refined via preference-based post-training. On the right, we show GenDoP results before and after post-training, demonstrating improved framing and quality in both graphics engine rendering and video generation.
  • Figure 2: Pipeline of VERTIGO. From a camera prompt, the generator produces 3D trajectories rendered into preview sequences by a graphics engine. A VLM performs inverse reasoning to caption the realized motion; the original prompt and generated caption are compared in latent space to derive preference scores for DPO post-training.
  • Figure 3: Different VLM scoring and fine-tuning strategies of VERTIGO. We explore three preference scoring methods: (a) cyclic semantic scoring via latent-space similarity. (b) tag-consistency scoring; (c) direct scalar regression via RAFT-style fine-tuning on interpolated trajectories;
  • Figure 4: Qualitative comparison of camera generators. VERTIGO accurately adheres to spatial composition instructions and maintains framing, whereas GenDoP and DIRECTOR misplace subjects and occasionally lose them.
  • Figure 5: Qualitative comparison of video-to-video transfer results. We compare VACE-based video transfer on two trajectories. For the first trajectory, we show GenDoP vs. VERTIGO transfer results; for the second, we compare static and animated character renderings transferred under the same trajectory. VERTIGO produces more robust framing after transfer, while character animation does not affect transfer quality, as trajectories are defined in the subject's local coordinate space.
  • ...and 5 more figures