Table of Contents
Fetching ...

EditIQ: Automated Cinematic Editing of Static Wide-Angle Videos via Dialogue Interpretation and Saliency Cues

Rohit Girmaji, Bhav Beri, Ramanathan Subramanian, Vineet Gandhi

TL;DR

EditIQ tackles automated cinematic editing of static wide-angle footage by combining a dialogue-driven understanding from an LLM with a vision-based saliency model to score candidate shots. It generates a set of virtual rushes, then selects an optimal edit path via a DP-based energy minimization that enforces cinematic constraints such as overlap, misframing, rhythm, and transitions. The approach yields strong gains over baselines and competitive results compared with expert human edits on the BBC-OSD dataset and theatre footage, demonstrating potential to reduce production costs while preserving narrative and emotional content. Limitations include real-time applicability and the need for human editors for final polish; nonetheless, EditIQ offers a scalable assistive tool for automated, dialogue-informed, visually-aware editing of large-stage performances.

Abstract

We present EditIQ, a completely automated framework for cinematically editing scenes captured via a stationary, large field-of-view and high-resolution camera. From the static camera feed, EditIQ initially generates multiple virtual feeds, emulating a team of cameramen. These virtual camera shots termed rushes are subsequently assembled using an automated editing algorithm, whose objective is to present the viewer with the most vivid scene content. To understand key scene elements and guide the editing process, we employ a two-pronged approach: (1) a large language model (LLM)-based dialogue understanding module to analyze conversational flow, coupled with (2) visual saliency prediction to identify meaningful scene elements and camera shots therefrom. We then formulate cinematic video editing as an energy minimization problem over shot selection, where cinematic constraints determine shot choices, transitions, and continuity. EditIQ synthesizes an aesthetically and visually compelling representation of the original narrative while maintaining cinematic coherence and a smooth viewing experience. Efficacy of EditIQ against competing baselines is demonstrated via a psychophysical study involving twenty participants on the BBC Old School dataset plus eleven theatre performance videos. Video samples from EditIQ can be found at https://editiq-ave.github.io/.

EditIQ: Automated Cinematic Editing of Static Wide-Angle Videos via Dialogue Interpretation and Saliency Cues

TL;DR

EditIQ tackles automated cinematic editing of static wide-angle footage by combining a dialogue-driven understanding from an LLM with a vision-based saliency model to score candidate shots. It generates a set of virtual rushes, then selects an optimal edit path via a DP-based energy minimization that enforces cinematic constraints such as overlap, misframing, rhythm, and transitions. The approach yields strong gains over baselines and competitive results compared with expert human edits on the BBC-OSD dataset and theatre footage, demonstrating potential to reduce production costs while preserving narrative and emotional content. Limitations include real-time applicability and the need for human editors for final polish; nonetheless, EditIQ offers a scalable assistive tool for automated, dialogue-informed, visually-aware editing of large-stage performances.

Abstract

We present EditIQ, a completely automated framework for cinematically editing scenes captured via a stationary, large field-of-view and high-resolution camera. From the static camera feed, EditIQ initially generates multiple virtual feeds, emulating a team of cameramen. These virtual camera shots termed rushes are subsequently assembled using an automated editing algorithm, whose objective is to present the viewer with the most vivid scene content. To understand key scene elements and guide the editing process, we employ a two-pronged approach: (1) a large language model (LLM)-based dialogue understanding module to analyze conversational flow, coupled with (2) visual saliency prediction to identify meaningful scene elements and camera shots therefrom. We then formulate cinematic video editing as an energy minimization problem over shot selection, where cinematic constraints determine shot choices, transitions, and continuity. EditIQ synthesizes an aesthetically and visually compelling representation of the original narrative while maintaining cinematic coherence and a smooth viewing experience. Efficacy of EditIQ against competing baselines is demonstrated via a psychophysical study involving twenty participants on the BBC Old School dataset plus eleven theatre performance videos. Video samples from EditIQ can be found at https://editiq-ave.github.io/.

Paper Structure

This paper contains 45 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: EditIQ Pipeline: This fully automated pipeline takes input in the form of video and face crops + IDs and outputs the completely edited video. The various parts of the pipeline are shown in the figure, with each step operating on the outputs of the previous ones.
  • Figure 2: Dialogue understanding module to get Contextual Potential from LLM for different shots based on the transcript of a scene. The post-processing in the above figure performs mapping between the LLM response and word level timestamps (from pre-processing) to get the cut locations.
  • Figure 3: Saliency potential of different single-order shots for two frames in a theatre video (potential value is shown along with the actor shot). Green arrow indicates the speaker, if any.
  • Figure 4: User Study Evaluation: Bar plots denoting mean user ratings for the different editing methodologies across four evaluation attributes for the (top row) BBC-OSD and (bottom row) Theatre recordings. Error bars denote unit standard deviation. Best viewed in color and under zoom.
  • Figure 5: Here, we compare our modified Saliency Prediction model with state-of-the-art ViNet Model vinet. Our model captures the essence of the whole scene and performs joint attention to capture interactions. It focuses on the key actor, whereas ViNet limits to head movements and captures all the faces as salient.