Table of Contents
Fetching ...

Edit3K: Universal Representation Learning for Video Editing Components

Xin Gu, Libo Zhang, Fan Chen, Longyin Wen, Yufei Wang, Tiejian Luo, Sijie Zhu

TL;DR

Edit3K introduces the first large-scale, six-type video editing component dataset and a guided embedding framework to learn universal, material-agnostic representations of editing components. The approach combines a guided spatial-temporal encoder, a guided embedding decoder, and embedding queues with a specialized contrastive loss, achieving state-of-the-art results on editing-component retrieval and transition recommendation. The work is validated through comprehensive ablations, distribution analyses, and a user study, demonstrating improved clustering of editing components and robust downstream performance. This dataset and method enable more effective editing-component understanding, supporting applications like recommendations, recognition, and generation in real-world video creation.

Abstract

This paper focuses on understanding the predominant video creation pipeline, i.e., compositional video editing with six main types of editing components, including video effects, animation, transition, filter, sticker, and text. In contrast to existing visual representation learning of visual materials (i.e., images/videos), we aim to learn visual representations of editing actions/components that are generally applied on raw materials. We start by proposing the first large-scale dataset for editing components of video creation, which covers about $3,094$ editing components with $618,800$ videos. Each video in our dataset is rendered by various image/video materials with a single editing component, which supports atomic visual understanding of different editing components. It can also benefit several downstream tasks, e.g., editing component recommendation, editing component recognition/retrieval, etc. Existing visual representation methods perform poorly because it is difficult to disentangle the visual appearance of editing components from raw materials. To that end, we benchmark popular alternative solutions and propose a novel method that learns to attend to the appearance of editing components regardless of raw materials. Our method achieves favorable results on editing component retrieval/recognition compared to the alternative solutions. A user study is also conducted to show that our representations cluster visually similar editing components better than other alternatives. Furthermore, our learned representations used to transition recommendation tasks achieve state-of-the-art results on the AutoTransition dataset. The code and dataset are available at https://github.com/GX77/Edit3K .

Edit3K: Universal Representation Learning for Video Editing Components

TL;DR

Edit3K introduces the first large-scale, six-type video editing component dataset and a guided embedding framework to learn universal, material-agnostic representations of editing components. The approach combines a guided spatial-temporal encoder, a guided embedding decoder, and embedding queues with a specialized contrastive loss, achieving state-of-the-art results on editing-component retrieval and transition recommendation. The work is validated through comprehensive ablations, distribution analyses, and a user study, demonstrating improved clustering of editing components and robust downstream performance. This dataset and method enable more effective editing-component understanding, supporting applications like recommendations, recognition, and generation in real-world video creation.

Abstract

This paper focuses on understanding the predominant video creation pipeline, i.e., compositional video editing with six main types of editing components, including video effects, animation, transition, filter, sticker, and text. In contrast to existing visual representation learning of visual materials (i.e., images/videos), we aim to learn visual representations of editing actions/components that are generally applied on raw materials. We start by proposing the first large-scale dataset for editing components of video creation, which covers about editing components with videos. Each video in our dataset is rendered by various image/video materials with a single editing component, which supports atomic visual understanding of different editing components. It can also benefit several downstream tasks, e.g., editing component recommendation, editing component recognition/retrieval, etc. Existing visual representation methods perform poorly because it is difficult to disentangle the visual appearance of editing components from raw materials. To that end, we benchmark popular alternative solutions and propose a novel method that learns to attend to the appearance of editing components regardless of raw materials. Our method achieves favorable results on editing component retrieval/recognition compared to the alternative solutions. A user study is also conducted to show that our representations cluster visually similar editing components better than other alternatives. Furthermore, our learned representations used to transition recommendation tasks achieve state-of-the-art results on the AutoTransition dataset. The code and dataset are available at https://github.com/GX77/Edit3K .
Paper Structure (20 sections, 2 equations, 9 figures, 6 tables)

This paper contains 20 sections, 2 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: An overview of generic video representation learning and editing components representation learning. The embedding of generic video representation learning is clustered based on video content, e.g., semantics, context, etc, while the ideal editing component representation should be only dependent on the applied editing components rather than the content of raw materials.
  • Figure 2: Examples of 6 major types of video editing components, i.e., video effect, animation, transition, filter, sticker, and text.
  • Figure 3: An overview of the proposed method. The input video frames are fed to the spatial and temporal encoder to generate the visual features. Then the embedding decoder takes the visual features as key, value and generates the final editing component embeddings using cross-attention mechanism with one query token. All encoders and decoders are guided with guidance tokens which are the embedding centers of the embedding saved in a queue. The model is optimized with InfoNCE oord2018representationhe2020momentum loss across the batch and the embedding queue provides extra negative samples for an extra loss term. Best viewed on screen with zoom-in.
  • Figure 4: An example of editing components retrieval, which includes the query video and 3094 candidate videos. The green box indicates the ground truth of this query video. Best viewed on screen with zoom-in.
  • Figure 5: An example from the user study. The users are asked to select all the videos that are visually similar to the query video. Best viewed on screen with zoom-in.
  • ...and 4 more figures