Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V
Meftun Akarsu, Kerem Catay, Sedat Bin Vedat, Enes Kutay Yarkan, Ilke Senturk, Arda Sar, Dafne Eksioglu
TL;DR
The paper addresses the challenge of producing film-like cinematic scenes from open-source video diffusion models using limited data. It proposes a parameter-efficient fine-tuning pipeline that injects LoRA adapters into Wan2.1 I2V-14B to learn cinematic appearance in the encoder and motion in the decoder, trained on roughly 40 El Turco clips (~16 minutes) with 33-frame windows and deployed as a self-contained 720p pipeline. Empirical results show improved cinematic fidelity and temporal stability over the base model, with LPIPS around 0.142 on validation and expert ratings showing meaningful gains, alongside near-linear speedups in multi-GPU inference via FSDP. The work provides a practical, open-source workflow enabling small teams to adapt large diffusion models to production-like aesthetics on commodity hardware, broadening access to cinema-ready generative AI tooling.
Abstract
We present a practical pipeline for fine-tuning open-source video diffusion transformers to synthesize cinematic scenes for television and film production from small datasets. The proposed two-stage process decouples visual style learning from motion generation. In the first stage, Low-Rank Adaptation (LoRA) modules are integrated into the cross-attention layers of the Wan2.1 I2V-14B model to adapt its visual representations using a compact dataset of short clips from Ay Yapim's historical television film El Turco. This enables efficient domain transfer within hours on a single GPU. In the second stage, the fine-tuned model produces stylistically consistent keyframes that preserve costume, lighting, and color grading, which are then temporally expanded into coherent 720p sequences through the model's video decoder. We further apply lightweight parallelization and sequence partitioning strategies to accelerate inference without quality degradation. Quantitative and qualitative evaluations using FVD, CLIP-SIM, and LPIPS metrics, supported by a small expert user study, demonstrate measurable improvements in cinematic fidelity and temporal stability over the base model. The complete training and inference pipeline is released to support reproducibility and adaptation across cinematic domains.
