Table of Contents
Fetching ...

CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation

Xinran Wang, Songyu Xu, Xiangxuan Shan, Yuxuan Zhang, Muxi Diao, Xueyan Duan, Yanhua Huang, Kongming Liang, Zhanyu Ma

TL;DR

CineTechBench introduces a comprehensive, expert-annotated benchmark to evaluate both understanding and generation of cinematographic techniques across seven core dimensions. It combines a formal taxonomy with a dataset of 600+ images and 120 clips, yielding 610 image QA pairs, 128 video QA pairs, and corresponding descriptions, suitable for evaluating 15+ multimodal LLMs and 5+ video-generation models. Evaluations reveal that current models struggle with fine-grained cinematography, particularly camera movement dynamics, lighting direction, and rotation orientation, underscoring substantial gaps between recognition and fluent description or synthesis. The benchmark and accompanying code aim to drive advances in automated film analysis and cinema-quality motion synthesis by providing targeted, domain-specific evaluation and insights into model limitations.

Abstract

Cinematography is a cornerstone of film production and appreciation, shaping mood, emotion, and narrative through visual elements such as camera movement, shot composition, and lighting. Despite recent progress in multimodal large language models (MLLMs) and video generation models, the capacity of current models to grasp and reproduce cinematographic techniques remains largely uncharted, hindered by the scarcity of expert-annotated data. To bridge this gap, we present CineTechBench, a pioneering benchmark founded on precise, manual annotation by seasoned cinematography experts across key cinematography dimensions. Our benchmark covers seven essential aspects-shot scale, shot angle, composition, camera movement, lighting, color, and focal length-and includes over 600 annotated movie images and 120 movie clips with clear cinematographic techniques. For the understanding task, we design question answer pairs and annotated descriptions to assess MLLMs' ability to interpret and explain cinematographic techniques. For the generation task, we assess advanced video generation models on their capacity to reconstruct cinema-quality camera movements given conditions such as textual prompts or keyframes. We conduct a large-scale evaluation on 15+ MLLMs and 5+ video generation models. Our results offer insights into the limitations of current models and future directions for cinematography understanding and generation in automatically film production and appreciation. The code and benchmark can be accessed at https://github.com/PRIS-CV/CineTechBench.

CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation

TL;DR

CineTechBench introduces a comprehensive, expert-annotated benchmark to evaluate both understanding and generation of cinematographic techniques across seven core dimensions. It combines a formal taxonomy with a dataset of 600+ images and 120 clips, yielding 610 image QA pairs, 128 video QA pairs, and corresponding descriptions, suitable for evaluating 15+ multimodal LLMs and 5+ video-generation models. Evaluations reveal that current models struggle with fine-grained cinematography, particularly camera movement dynamics, lighting direction, and rotation orientation, underscoring substantial gaps between recognition and fluent description or synthesis. The benchmark and accompanying code aim to drive advances in automated film analysis and cinema-quality motion synthesis by providing targeted, domain-specific evaluation and insights into model limitations.

Abstract

Cinematography is a cornerstone of film production and appreciation, shaping mood, emotion, and narrative through visual elements such as camera movement, shot composition, and lighting. Despite recent progress in multimodal large language models (MLLMs) and video generation models, the capacity of current models to grasp and reproduce cinematographic techniques remains largely uncharted, hindered by the scarcity of expert-annotated data. To bridge this gap, we present CineTechBench, a pioneering benchmark founded on precise, manual annotation by seasoned cinematography experts across key cinematography dimensions. Our benchmark covers seven essential aspects-shot scale, shot angle, composition, camera movement, lighting, color, and focal length-and includes over 600 annotated movie images and 120 movie clips with clear cinematographic techniques. For the understanding task, we design question answer pairs and annotated descriptions to assess MLLMs' ability to interpret and explain cinematographic techniques. For the generation task, we assess advanced video generation models on their capacity to reconstruct cinema-quality camera movements given conditions such as textual prompts or keyframes. We conduct a large-scale evaluation on 15+ MLLMs and 5+ video generation models. Our results offer insights into the limitations of current models and future directions for cinematography understanding and generation in automatically film production and appreciation. The code and benchmark can be accessed at https://github.com/PRIS-CV/CineTechBench.

Paper Structure

This paper contains 38 sections, 3 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Cinematography taxonomy and data examples in our CineTechBench.
  • Figure 2: Our benchmark focus on the cinematographic techniques in film production and appreciation. Compared with similar benchmarks, our benchmark include more core dimensions in cinematography.
  • Figure 3: Overview of our benchmark building process.
  • Figure 4: Visualization of MLLMs' answers on cinematographic technique question answering task. The red text highlights the wrong answers and the green text highlights the correct answers. More visualization examples can be seen in Appendix \ref{['appendix:visualization']}.
  • Figure 5: Generated movie clips by different video generation models and the corresponding camera trajectory estimated by Monst3r zhang-monstr-2025-ICLR. More examples are shown in Appendix \ref{['appendix:visualization']}.
  • ...and 10 more figures