Can video generation replace cinematographers? Research on the cinematic language of generated video
Xiaozhe Li, Kai WU, Siyi Yang, YiZhan Qu, Guohua. Zhang, Zhiyu Chen, Jiayao Li, Jiangchuan Mu, Xiaobin Hu, Wen Fang, Mingliang Xiong, Hao Deng, Qingwen Liu, Gang Li, Bin He
TL;DR
The paper tackles the gap in cinematic-language control within text-to-video diffusion by introducing a threefold framework: Cinematic2K, a ~2,000-video cinematic-language dataset spanning 20 subcategories; CameraDiff, a LoRA-based system enabling precise single-shot cinematic control across 20 types; and CameraCLIP with CLIPLoRA, a dynamic, CLIP-guided mechanism for multi-shot fusion within a single video. CameraCLIP achieves state-of-the-art cinematic alignment with an $R@1$ of $0.83$, while CameraDiff provides stable, fine-grained control and CLIPLoRA enables smooth transitions between multiple cinematic LoRAs. Ablation studies validate the mean-pooling temporal representation and demonstrate the superiority of CameraCLIP as an evaluator for LoRA selection in diffusion. Collectively, the work advances controllable cinematic-language generation in T2V, offering a practical path toward bridging automated video synthesis and professional cinematography.
Abstract
Recent advancements in text-to-video (T2V) generation have leveraged diffusion models to enhance visual coherence in videos synthesized from textual descriptions. However, existing research primarily focuses on object motion, often overlooking cinematic language, which is crucial for conveying emotion and narrative pacing in cinematography. To address this, we propose a threefold approach to improve cinematic control in T2V models. First, we introduce a meticulously annotated cinematic language dataset with twenty subcategories, covering shot framing, shot angles, and camera movements, enabling models to learn diverse cinematic styles. Second, we present CameraDiff, which employs LoRA for precise and stable cinematic control, ensuring flexible shot generation. Third, we propose CameraCLIP, designed to evaluate cinematic alignment and guide multi-shot composition. Building on CameraCLIP, we introduce CLIPLoRA, a CLIP-guided dynamic LoRA composition method that adaptively fuses multiple pre-trained cinematic LoRAs, enabling smooth transitions and seamless style blending. Experimental results demonstrate that CameraDiff ensures stable and precise cinematic control, CameraCLIP achieves an R@1 score of 0.83, and CLIPLoRA significantly enhances multi-shot composition within a single video, bridging the gap between automated video generation and professional cinematography.\textsuperscript{1}
