ExpressEdit: Video Editing with Natural Language and Sketching

Bekzat Tilekbay; Saelyne Yang; Michal Lewkowicz; Alex Suryapranata; Juho Kim

ExpressEdit: Video Editing with Natural Language and Sketching

Bekzat Tilekbay, Saelyne Yang, Michal Lewkowicz, Alex Suryapranata, Juho Kim

Abstract

Informational videos serve as a crucial source for explaining conceptual and procedural knowledge to novices and experts alike. When producing informational videos, editors edit videos by overlaying text/images or trimming footage to enhance the video quality and make it more engaging. However, video editing can be difficult and time-consuming, especially for novice video editors who often struggle with expressing and implementing their editing ideas. To address this challenge, we first explored how multimodality$-$natural language (NL) and sketching, which are natural modalities humans use for expression$-$can be utilized to support video editors in expressing video editing ideas. We gathered 176 multimodal expressions of editing commands from 10 video editors, which revealed the patterns of use of NL and sketching in describing edit intents. Based on the findings, we present ExpressEdit, a system that enables editing videos via NL text and sketching on the video frame. Powered by LLM and vision models, the system interprets (1) temporal, (2) spatial, and (3) operational references in an NL command and spatial references from sketching. The system implements the interpreted edits, which then the user can iterate on. An observational study (N=10) showed that ExpressEdit enhanced the ability of novice video editors to express and implement their edit ideas. The system allowed participants to perform edits more efficiently and generate more ideas by generating edits based on user's multimodal edit commands and supporting iterations on the editing commands. This work offers insights into the design of future multimodal interfaces and AI-based pipelines for video editing.

ExpressEdit: Video Editing with Natural Language and Sketching

Abstract

natural language (NL) and sketching, which are natural modalities humans use for expression

can be utilized to support video editors in expressing video editing ideas. We gathered 176 multimodal expressions of editing commands from 10 video editors, which revealed the patterns of use of NL and sketching in describing edit intents. Based on the findings, we present ExpressEdit, a system that enables editing videos via NL text and sketching on the video frame. Powered by LLM and vision models, the system interprets (1) temporal, (2) spatial, and (3) operational references in an NL command and spatial references from sketching. The system implements the interpreted edits, which then the user can iterate on. An observational study (N=10) showed that ExpressEdit enhanced the ability of novice video editors to express and implement their edit ideas. The system allowed participants to perform edits more efficiently and generate more ideas by generating edits based on user's multimodal edit commands and supporting iterations on the editing commands. This work offers insights into the design of future multimodal interfaces and AI-based pipelines for video editing.

Paper Structure (63 sections, 4 figures, 12 tables)

This paper contains 63 sections, 4 figures, 12 tables.

Introduction
Related Work
Video Editing Systems
Multimodal Interaction
Formative Study
Participants
Study Materials
Procedure
Findings and Analysis
Expressing edit commands with multi-modalities
Participants consistently referenced moments in the video with NL text.
Participants used both NL text and sketching on top of the frame to reference the spatial location of the edits.
Participants used NL text to refer to edit operations and their parameters.
Participants frequently iterated on their edit commands to make them clearer.
Design Goals
...and 48 more sections

Figures (4)

Figure 1: With ExpressEdit, the user can (a) input an edit command using natural language in the edit description box and (b) optionally specify the location using the sketch function. The system analyzes the request and (c) shows the parts of the NL prompt that correspond to the user's intended edit operation, parameters, spatial location, and temporal location. (d) The editor canvas shows the preview of the edits and allows clicking and dragging. (e) Users can also manually adjust the resulting edit operation as well as its given parameters. (f) Users can navigate through the edit suggestions and accept or decline, (g) as well as quickly navigate through the video timeline. (h, i) The timeline and transcript shows the temporal location of applied edits and edit suggestions.
Figure 2: The offline component of the pipeline pre-processes the video and extracts textual and visual metadata.
Figure 3: The online component of the pipeline uses GPT-4 and CLIP to interpret NL text and sketching edit command.
Figure 4: The pie plot shows the distribution of time spent in (1) ideating about what edits to implement, (2) describing their edit requests, (3) examining the suggested edits returned by the system, and (4) manually editing the video. The time spent for each editing process stage is derived from user interaction logs.

ExpressEdit: Video Editing with Natural Language and Sketching

Abstract

ExpressEdit: Video Editing with Natural Language and Sketching

Authors

Abstract

Table of Contents

Figures (4)