RefineShot: Rethinking Cinematography Understanding with Foundational Skill Evaluation
Hang Wu, Yujun Cai, Haonan Ge, Hongkai Chen, Ming-Hsuan Yang, Yiwei Wang
TL;DR
This work identifies reliability and fairness gaps in existing cinematography-understanding benchmarks and baselines. It refines ShotBench into RefineShot by standardizing option granularity and mutual exclusivity, and introduces an expanded evaluation protocol that couples task accuracy with core reasoning competencies. Through analyses of ShotVL and Qwen models, the study reveals that high accuracy can coincide with weak reasoning faithfulness and poor instruction adherence, while robust reasoning-capable models like Qwen maintain reliability under structured prompts. The proposed framework offers a more principled basis for evaluating cinematography understanding and guiding future improvements in multimodal models for narrative-driven visual reasoning.
Abstract
Cinematography understanding refers to the ability to recognize not only the visual content of a scene but also the cinematic techniques that shape narrative meaning. This capability is attracting increasing attention, as it enhances multimodal understanding in real-world applications and underpins coherent content creation in film and media. As the most comprehensive benchmark for this task, ShotBench spans a wide range of cinematic concepts and VQA-style evaluations, with ShotVL achieving state-of-the-art results on it. However, our analysis reveals that ambiguous option design in ShotBench and ShotVL's shortcomings in reasoning consistency and instruction adherence undermine evaluation reliability, limiting fair comparison and hindering future progress. To overcome these issues, we systematically refine ShotBench through consistent option restructuring, conduct the first critical analysis of ShotVL's reasoning behavior, and introduce an extended evaluation protocol that jointly assesses task accuracy and core model competencies. These efforts lead to RefineShot, a refined and expanded benchmark that enables more reliable assessment and fosters future advances in cinematography understanding.
