Table of Contents
Fetching ...

Seeking Universal Shot Language Understanding Solutions

Haoxin Liu, Harshavardhan Kamarthi, Zhiyuan Zhao, Hongjie Chen, B. Aditya Prakash

Abstract

Shot language understanding (SLU) is crucial for cinematic analysis but remains challenging due to its diverse cinematographic dimensions and subjective expert judgment. While vision-language models (VLMs) have shown strong ability in general visual understanding, recent studies reveal judgment discrepancies between VLMs and film experts on SLU tasks. To address this gap, we introduce SLU-SUITE, a comprehensive training and evaluation suite containing 490K human-annotated QA pairs across 33 tasks spanning six film-grounded dimensions. Using SLU-SUITE, we originally observe two insights into VLM-based SLU from: the model side, which diagnoses key bottlenecks of modules; the data side, which quantifies cross-dimensional influences among tasks. These findings motivate our universal SLU solutions from two complementary paradigms: UniShot, a balanced one-for-all generalist trained via dynamic-balanced data mixing, and AgentShots, a prompt-routed expert cluster that maximizes peak dimension performance. Extensive experiments show that our models outperform task-specific ensembles on in-domain tasks and surpass leading commercial VLMs by 22% on out-of-domain tasks.

Seeking Universal Shot Language Understanding Solutions

Abstract

Shot language understanding (SLU) is crucial for cinematic analysis but remains challenging due to its diverse cinematographic dimensions and subjective expert judgment. While vision-language models (VLMs) have shown strong ability in general visual understanding, recent studies reveal judgment discrepancies between VLMs and film experts on SLU tasks. To address this gap, we introduce SLU-SUITE, a comprehensive training and evaluation suite containing 490K human-annotated QA pairs across 33 tasks spanning six film-grounded dimensions. Using SLU-SUITE, we originally observe two insights into VLM-based SLU from: the model side, which diagnoses key bottlenecks of modules; the data side, which quantifies cross-dimensional influences among tasks. These findings motivate our universal SLU solutions from two complementary paradigms: UniShot, a balanced one-for-all generalist trained via dynamic-balanced data mixing, and AgentShots, a prompt-routed expert cluster that maximizes peak dimension performance. Extensive experiments show that our models outperform task-specific ensembles on in-domain tasks and surpass leading commercial VLMs by 22% on out-of-domain tasks.
Paper Structure (104 sections, 5 theorems, 22 equations, 5 figures, 11 tables, 1 algorithm)

This paper contains 104 sections, 5 theorems, 22 equations, 5 figures, 11 tables, 1 algorithm.

Key Result

Proposition 1

Under Assumptions ass:balance--ass:normalizer, where is a composite label prior, and the residual satisfies $\mathbb{E}_H[|r_y(H)|]\le c_1\varepsilon+c_2\eta$.

Figures (5)

  • Figure 1: Cross-dimensional transfer matrix. Each column denotes the source training dimension and each row denotes the target test dimension. The NA column corresponds to initial performance of the VLM backbone (Qwen3-VL-8B). Transfer is broadly useful, but clearly uneven across targets.
  • Figure 2: Ablation of data mixing strategies. Accuracy is normalized by AgentShots_naive; $1.0$ means matching specialist trained only on in-dimension data. Compared with UniShot_naive, UniShot improves the weakest dimensions and average performances, producing a more balanced single-model generalist. Compared with AgentShots_naive, AgentShots improves all six dimensions and achieves the best result on 5 of 6. MoLEwumixture serves as advanced multi-LoRA baseline, testing the performances of end-to-end data mixing. See App. \ref{['app:recipe-ablation']} for raw results.
  • Figure 3: Results of dataset variants. At a matched 4K budget, heterogeneous (multi-source, i.e., randomly sampled from our SLU-SUITE) dataset yields a significant improvement over the homogeneous dataset (mainly ShotBench). Further scaling the heterogeneous pool up to 410K provides progressive performance improvement. See detailed results in App. \ref{['tab:hetero-scale-full']}.
  • Figure 4: Unified multiple-choice prompt template used for the majority of SLU-SUITE tasks, including composition, coverage, viewpoint, lighting, motion classification, and cut-type recognition. The multimodal placeholder is always placed at the first line.
  • Figure 5: Causal view of subjective supervision in SLU. The source $s$ affects both the visual distribution and the annotation mechanism. We separate semantic evidence $H=\phi(X)$ from source-dependent annotation effects: a taxonomy/readout operator $A_s$ and a preference term $b_s$.

Theorems & Definitions (11)

  • Remark 1: Scope of the causal formulation
  • Proposition 1: Dominant source-induced prior
  • proof
  • Corollary 1: Structural coverage recovery
  • proof
  • Corollary 2: Preference dilution
  • proof
  • Corollary 3: Cross-source contrast identifies preference differences
  • proof
  • Proposition 2: Source diversity stabilises the prior
  • ...and 1 more