Table of Contents
Fetching ...

RefineShot: Rethinking Cinematography Understanding with Foundational Skill Evaluation

Hang Wu, Yujun Cai, Haonan Ge, Hongkai Chen, Ming-Hsuan Yang, Yiwei Wang

TL;DR

This work identifies reliability and fairness gaps in existing cinematography-understanding benchmarks and baselines. It refines ShotBench into RefineShot by standardizing option granularity and mutual exclusivity, and introduces an expanded evaluation protocol that couples task accuracy with core reasoning competencies. Through analyses of ShotVL and Qwen models, the study reveals that high accuracy can coincide with weak reasoning faithfulness and poor instruction adherence, while robust reasoning-capable models like Qwen maintain reliability under structured prompts. The proposed framework offers a more principled basis for evaluating cinematography understanding and guiding future improvements in multimodal models for narrative-driven visual reasoning.

Abstract

Cinematography understanding refers to the ability to recognize not only the visual content of a scene but also the cinematic techniques that shape narrative meaning. This capability is attracting increasing attention, as it enhances multimodal understanding in real-world applications and underpins coherent content creation in film and media. As the most comprehensive benchmark for this task, ShotBench spans a wide range of cinematic concepts and VQA-style evaluations, with ShotVL achieving state-of-the-art results on it. However, our analysis reveals that ambiguous option design in ShotBench and ShotVL's shortcomings in reasoning consistency and instruction adherence undermine evaluation reliability, limiting fair comparison and hindering future progress. To overcome these issues, we systematically refine ShotBench through consistent option restructuring, conduct the first critical analysis of ShotVL's reasoning behavior, and introduce an extended evaluation protocol that jointly assesses task accuracy and core model competencies. These efforts lead to RefineShot, a refined and expanded benchmark that enables more reliable assessment and fosters future advances in cinematography understanding.

RefineShot: Rethinking Cinematography Understanding with Foundational Skill Evaluation

TL;DR

This work identifies reliability and fairness gaps in existing cinematography-understanding benchmarks and baselines. It refines ShotBench into RefineShot by standardizing option granularity and mutual exclusivity, and introduces an expanded evaluation protocol that couples task accuracy with core reasoning competencies. Through analyses of ShotVL and Qwen models, the study reveals that high accuracy can coincide with weak reasoning faithfulness and poor instruction adherence, while robust reasoning-capable models like Qwen maintain reliability under structured prompts. The proposed framework offers a more principled basis for evaluating cinematography understanding and guiding future improvements in multimodal models for narrative-driven visual reasoning.

Abstract

Cinematography understanding refers to the ability to recognize not only the visual content of a scene but also the cinematic techniques that shape narrative meaning. This capability is attracting increasing attention, as it enhances multimodal understanding in real-world applications and underpins coherent content creation in film and media. As the most comprehensive benchmark for this task, ShotBench spans a wide range of cinematic concepts and VQA-style evaluations, with ShotVL achieving state-of-the-art results on it. However, our analysis reveals that ambiguous option design in ShotBench and ShotVL's shortcomings in reasoning consistency and instruction adherence undermine evaluation reliability, limiting fair comparison and hindering future progress. To overcome these issues, we systematically refine ShotBench through consistent option restructuring, conduct the first critical analysis of ShotVL's reasoning behavior, and introduce an extended evaluation protocol that jointly assesses task accuracy and core model competencies. These efforts lead to RefineShot, a refined and expanded benchmark that enables more reliable assessment and fosters future advances in cinematography understanding.

Paper Structure

This paper contains 17 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of our work. We first analyze and refine the options in ShotBench to address their inconsistencies, then examine state-of-the-art models and reveal their reliability defects. Based on these findings, we propose a new evaluation protocol and demonstrate its effectiveness through comprehensive experiments.
  • Figure 2: Refining dataset options by introducing a finer-grained taxonomy and replacing inconsistent choices in ShotBench. This ensures that options within each question are mutually exclusive and of consistent granularity.
  • Figure 3: Refinement Case. This figure shows how inconsistent lighting type labels are improved for a benchmark dataset. We first map the ground-truth option to its corresponding refined category, remove options from mismatched categories, replace them with alternatives from the same category, and finally randomize the order to ensure fairness.
  • Figure 4: Model Analysis. This figure shows two main defects of ShotVL models: reasoning unfaithfulness, with frequent mismatches between reasoning and answers, and poor instruction adherence, where prompts are ignored in favor of long repetitive outputs.
  • Figure 5: Instruction adherence case. This case shows the instruction adherence of different models. When given a demonstration-based prompt, ShotVL fails to follow the instructions and produces disorganized reasoning, whereas Qwen accurately follows the format, outputting each step and the final answer as required.