Table of Contents
Fetching ...

Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V

Meftun Akarsu, Kerem Catay, Sedat Bin Vedat, Enes Kutay Yarkan, Ilke Senturk, Arda Sar, Dafne Eksioglu

TL;DR

The paper addresses the challenge of producing film-like cinematic scenes from open-source video diffusion models using limited data. It proposes a parameter-efficient fine-tuning pipeline that injects LoRA adapters into Wan2.1 I2V-14B to learn cinematic appearance in the encoder and motion in the decoder, trained on roughly 40 El Turco clips (~16 minutes) with 33-frame windows and deployed as a self-contained 720p pipeline. Empirical results show improved cinematic fidelity and temporal stability over the base model, with LPIPS around 0.142 on validation and expert ratings showing meaningful gains, alongside near-linear speedups in multi-GPU inference via FSDP. The work provides a practical, open-source workflow enabling small teams to adapt large diffusion models to production-like aesthetics on commodity hardware, broadening access to cinema-ready generative AI tooling.

Abstract

We present a practical pipeline for fine-tuning open-source video diffusion transformers to synthesize cinematic scenes for television and film production from small datasets. The proposed two-stage process decouples visual style learning from motion generation. In the first stage, Low-Rank Adaptation (LoRA) modules are integrated into the cross-attention layers of the Wan2.1 I2V-14B model to adapt its visual representations using a compact dataset of short clips from Ay Yapim's historical television film El Turco. This enables efficient domain transfer within hours on a single GPU. In the second stage, the fine-tuned model produces stylistically consistent keyframes that preserve costume, lighting, and color grading, which are then temporally expanded into coherent 720p sequences through the model's video decoder. We further apply lightweight parallelization and sequence partitioning strategies to accelerate inference without quality degradation. Quantitative and qualitative evaluations using FVD, CLIP-SIM, and LPIPS metrics, supported by a small expert user study, demonstrate measurable improvements in cinematic fidelity and temporal stability over the base model. The complete training and inference pipeline is released to support reproducibility and adaptation across cinematic domains.

Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V

TL;DR

The paper addresses the challenge of producing film-like cinematic scenes from open-source video diffusion models using limited data. It proposes a parameter-efficient fine-tuning pipeline that injects LoRA adapters into Wan2.1 I2V-14B to learn cinematic appearance in the encoder and motion in the decoder, trained on roughly 40 El Turco clips (~16 minutes) with 33-frame windows and deployed as a self-contained 720p pipeline. Empirical results show improved cinematic fidelity and temporal stability over the base model, with LPIPS around 0.142 on validation and expert ratings showing meaningful gains, alongside near-linear speedups in multi-GPU inference via FSDP. The work provides a practical, open-source workflow enabling small teams to adapt large diffusion models to production-like aesthetics on commodity hardware, broadening access to cinema-ready generative AI tooling.

Abstract

We present a practical pipeline for fine-tuning open-source video diffusion transformers to synthesize cinematic scenes for television and film production from small datasets. The proposed two-stage process decouples visual style learning from motion generation. In the first stage, Low-Rank Adaptation (LoRA) modules are integrated into the cross-attention layers of the Wan2.1 I2V-14B model to adapt its visual representations using a compact dataset of short clips from Ay Yapim's historical television film El Turco. This enables efficient domain transfer within hours on a single GPU. In the second stage, the fine-tuned model produces stylistically consistent keyframes that preserve costume, lighting, and color grading, which are then temporally expanded into coherent 720p sequences through the model's video decoder. We further apply lightweight parallelization and sequence partitioning strategies to accelerate inference without quality degradation. Quantitative and qualitative evaluations using FVD, CLIP-SIM, and LPIPS metrics, supported by a small expert user study, demonstrate measurable improvements in cinematic fidelity and temporal stability over the base model. The complete training and inference pipeline is released to support reproducibility and adaptation across cinematic domains.

Paper Structure

This paper contains 23 sections, 8 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: Cinematic Scene Synthesis from El Turco. Our LoRA-enhanced Wan 2.1 I2V model generates temporally coherent battlefield sequences preserving costume detail, atmospheric lighting, and historical authenticity. The fine-tuned model maintains chainmail texture, helmet geometry, and fog diffusion across frames while ensuring stable camera behavior typical of cinematic production.
  • Figure 2: Comprehensive Visual Results from El Turco Fine-Tuning. Generated sequences demonstrating temporal coherence and stylistic consistency across diverse scene compositions, camera angles, and lighting conditions. The figure presents 24 frames across 8 sequential rows, illustrating the model's capability to maintain cinematic quality throughout extended sequences. Each row represents a distinct scene or camera angle: close-up helmet details (rows 1--2), wide battlefield formations with atmospheric lighting (rows 3--4), dramatic single-subject shots (rows 5--6), and ensemble compositions with historical armor detail (rows 7--8). All sequences generated at 720p (1280$\times$720) with 30 denoising steps and CFG scale 3.8, demonstrating the model's internalization of El Turco's complete visual grammar while maintaining production-quality cinematography and historical authenticity.