Table of Contents
Fetching ...

Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation

Agneet Chatterjee, Rahim Entezari, Maksym Zhuravinskyi, Maksim Lapin, Reshinth Adithyan, Amit Raj, Chitta Baral, Yezhou Yang, Varun Jampani

TL;DR

SCINE (Stable Cinemetrics) introduces aStructured, four-pillar cinematic taxonomy—Setup, Events, Lighting, Camera—comprising 76 leaf controls to benchmark professional video generation. It pairs this taxonomy with SCINE-Scripts and SCINE-Visuals prompts and an automatic, node-level question generation pipeline, enabling fine-grained evaluation. A large-scale human study across 10+ models and 20K videos, annotated by 80+ film professionals, reveals persistent gaps, especially in Events and Camera controls, while a trained vision-language model achieves 72.36% alignment with expert judgments, enabling scalable assessment. The work provides a principled framework for diagnosing model capabilities and guiding future improvements toward production-ready professional video generation.

Abstract

Recent advances in video generation have enabled high-fidelity video synthesis from user provided prompts. However, existing models and benchmarks fail to capture the complexity and requirements of professional video generation. Towards that goal, we introduce Stable Cinemetrics, a structured evaluation framework that formalizes filmmaking controls into four disentangled, hierarchical taxonomies: Setup, Event, Lighting, and Camera. Together, these taxonomies define 76 fine-grained control nodes grounded in industry practices. Using these taxonomies, we construct a benchmark of prompts aligned with professional use cases and develop an automated pipeline for prompt categorization and question generation, enabling independent evaluation of each control dimension. We conduct a large-scale human study spanning 10+ models and 20K videos, annotated by a pool of 80+ film professionals. Our analysis, both coarse and fine-grained reveal that even the strongest current models exhibit significant gaps, particularly in Events and Camera-related controls. To enable scalable evaluation, we train an automatic evaluator, a vision-language model aligned with expert annotations that outperforms existing zero-shot baselines. SCINE is the first approach to situate professional video generation within the landscape of video generative models, introducing taxonomies centered around cinematic controls and supporting them with structured evaluation pipelines and detailed analyses to guide future research.

Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation

TL;DR

SCINE (Stable Cinemetrics) introduces aStructured, four-pillar cinematic taxonomy—Setup, Events, Lighting, Camera—comprising 76 leaf controls to benchmark professional video generation. It pairs this taxonomy with SCINE-Scripts and SCINE-Visuals prompts and an automatic, node-level question generation pipeline, enabling fine-grained evaluation. A large-scale human study across 10+ models and 20K videos, annotated by 80+ film professionals, reveals persistent gaps, especially in Events and Camera controls, while a trained vision-language model achieves 72.36% alignment with expert judgments, enabling scalable assessment. The work provides a principled framework for diagnosing model capabilities and guiding future improvements toward production-ready professional video generation.

Abstract

Recent advances in video generation have enabled high-fidelity video synthesis from user provided prompts. However, existing models and benchmarks fail to capture the complexity and requirements of professional video generation. Towards that goal, we introduce Stable Cinemetrics, a structured evaluation framework that formalizes filmmaking controls into four disentangled, hierarchical taxonomies: Setup, Event, Lighting, and Camera. Together, these taxonomies define 76 fine-grained control nodes grounded in industry practices. Using these taxonomies, we construct a benchmark of prompts aligned with professional use cases and develop an automated pipeline for prompt categorization and question generation, enabling independent evaluation of each control dimension. We conduct a large-scale human study spanning 10+ models and 20K videos, annotated by a pool of 80+ film professionals. Our analysis, both coarse and fine-grained reveal that even the strongest current models exhibit significant gaps, particularly in Events and Camera-related controls. To enable scalable evaluation, we train an automatic evaluator, a vision-language model aligned with expert annotations that outperforms existing zero-shot baselines. SCINE is the first approach to situate professional video generation within the landscape of video generative models, introducing taxonomies centered around cinematic controls and supporting them with structured evaluation pipelines and detailed analyses to guide future research.

Paper Structure

This paper contains 26 sections, 31 figures, 13 tables.

Figures (31)

  • Figure 1: Stable Cinemetrics introduces structured taxonomies grounded in the controls required for professional video generation. These taxonomies form the foundation of our prompt based benchmark that mirrors real-world shot creation, progressing from scriptwriting to on-screen visuals. Every control element in a prompt is automatically categorized back to the taxonomy, enabling the generation of isolated evaluation questions for independent investigation into each element. This supports large scale human evaluation enabling both coarse and fine-grained insights into the capabilities of current models for professional video generation. To drive scalable annotations, we develop our own VLMs that outperform existing models in alignment with human judgements.
  • Figure 2: The Setup taxonomy outlines the visual components within the frame, including subjects, props, and environmental context.
  • Figure 3: The Camera taxonomy defines all controls related to camera configuration during a shot setup.
  • Figure 4: The Lighting taxonomy specifies the illumination of shot, through light sources, their properties, and their interaction with the scene.
  • Figure 6: The Events taxonomy captures the narrative dimension of a shot which includes actions, emotions, and their fine grained portrayal as they evolve over time within a shot.
  • ...and 26 more figures