Table of Contents
Fetching ...

ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing

Yaosen Chen, Wei Wang, Tianheng Zheng, Xuming Wen, Han Yang, Yanru Zhang

TL;DR

This work tackles the challenge of automated yet artistically coherent video editing by formulating shot assembly as an energy-based optimization that integrates semantic guidance, reference-driven syntax learning, and multi-constraint scoring. The core methodology combines visual-semantic matching with LLM-generated scripts, syntax extraction from reference videos, and a unified energy function that captures shot size, motion, and semantics, optimized via a discrete Langevin-like framework augmented with Beam Search or Genetic Algorithms. Key contributions include learning a reference-based shot-size syntax prior, extending to multiple syntactic dimensions (e.g., motion) with CLIP-based semantic energy, and demonstrating that a Langevin+GA hybrid yields superior optimization performance and style fidelity. Experiments show that the proposed approach achieves higher subjective style similarity and lower transition-matrix MSE than competitive tools, enabling non-experts to produce visually and narratively coherent videos that reflect target editing styles with practical efficiency.

Abstract

Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator's unique artistic expression in shot assembly. To address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ energy-based models to learn from these attributes, scoring candidate shot sequences based on their alignment with reference styles. Finally, we achieve shot assembly optimization by combining multiple syntax rules, producing videos that align with the assembly style of the reference videos. Our method not only automates the arrangement and combination of independent shots according to specific logic, narrative requirements, or artistic styles but also learns the assembly style of reference videos, creating a coherent visual sequence or holistic visual expression. With our system, even users with no prior video editing experience can create visually compelling videos. Project page: https://sobeymil.github.io/esa.com

ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing

TL;DR

This work tackles the challenge of automated yet artistically coherent video editing by formulating shot assembly as an energy-based optimization that integrates semantic guidance, reference-driven syntax learning, and multi-constraint scoring. The core methodology combines visual-semantic matching with LLM-generated scripts, syntax extraction from reference videos, and a unified energy function that captures shot size, motion, and semantics, optimized via a discrete Langevin-like framework augmented with Beam Search or Genetic Algorithms. Key contributions include learning a reference-based shot-size syntax prior, extending to multiple syntactic dimensions (e.g., motion) with CLIP-based semantic energy, and demonstrating that a Langevin+GA hybrid yields superior optimization performance and style fidelity. Experiments show that the proposed approach achieves higher subjective style similarity and lower transition-matrix MSE than competitive tools, enabling non-experts to produce visually and narratively coherent videos that reflect target editing styles with practical efficiency.

Abstract

Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator's unique artistic expression in shot assembly. To address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ energy-based models to learn from these attributes, scoring candidate shot sequences based on their alignment with reference styles. Finally, we achieve shot assembly optimization by combining multiple syntax rules, producing videos that align with the assembly style of the reference videos. Our method not only automates the arrangement and combination of independent shots according to specific logic, narrative requirements, or artistic styles but also learns the assembly style of reference videos, creating a coherent visual sequence or holistic visual expression. With our system, even users with no prior video editing experience can create visually compelling videos. Project page: https://sobeymil.github.io/esa.com

Paper Structure

This paper contains 26 sections, 18 equations, 5 figures, 6 tables, 5 algorithms.

Figures (5)

  • Figure 1: Energy-based shot assembly optimization. Given a specific theme and a library of video clips, our method employs an energy-based model to search for an optimal shot assembly that aligns with the thematic semantics, editing syntax, and user intent. Combined with additional post-production processes such as voice-overs and subtitles, this approach enables high-quality, intelligent video editing.
  • Figure 2: Overall Framework of Energy-Based Shot Assembly Optimization for Automatic Video Editing. Our approach consists of modules such as "Shot Segmentation and Label Extraction", "Visual-Semantic Matching", and "Multiple Syntaxes Joint Assembly Optimization". These modules work together to automatically generate an edited video sequence that reflects the shot scale style and camera motion style of a reference sequence.
  • Figure 3: Visual Comparison of Video Editing. We compare the results of video editing using the same "Script Text Content" and "Video Repository", and our method achieves better visual-text alignment.
  • Figure 4: Visual Comparison of the Transition Score Matrices for Shot Size and Camera Motion Syntax in the Edited Videos. S0 to S4 respectively represent the shot size attributes: Extreme Long Shot (ELS), Long Shot (LS), Medium Shot (MS), Close-Up (CU), and Extreme Close-Up (ECU). C0 to C6 respectively represent the camera motion attributes: Stable, Up, Down, Left, Right, Out, and In.
  • Figure S1: Langevin Sampling Video Assembly Optimization Process.