Table of Contents
Fetching ...

AdaFlow: Efficient Long Video Editing via Adaptive Attention Slimming And Keyframe Selection

Shuheng Zhang, Yuqi Liu, Hongbo Zhou, Jun Peng, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji

TL;DR

AdaFlow tackles the memory bottleneck in text-driven long video editing by introducing two training-free mechanisms: Adaptive Keyframe Selection (AKS) and Adaptive Attention Slimming (AAS). AKS partitions video content into clips and selects representative keyframes, while AAS prunes the KV sequence in Extended Self-Attention to enable more keyframes and longer edits, with latent propagation ensuring frame-to-frame continuity via precomputed token correspondences. The authors validate on LongV-EVAL, a new 75-video benchmark with high-quality annotations, and show AdaFlow edits sequences of over $1k$ frames in one inference on an A800 GPU, outperforming several baselines in efficiency and quality. The work offers a practical, resource-efficient approach to long video editing and provides a benchmark for future evaluation.

Abstract

Despite great progress, text-driven long video editing is still notoriously challenging mainly due to excessive memory overhead. Although recent efforts have simplified this task into a two-step process of keyframe translation and interpolation generation, the token-wise keyframe translation still plagues the upper limit of video length. In this paper, we propose a novel and training-free approach towards efficient and effective long video editing, termed AdaFlow. We first reveal that not all tokens of video frames hold equal importance for keyframe translation, based on which we propose an Adaptive Attention Slimming scheme for AdaFlow to squeeze the $KV$ sequence, thus increasing the number of keyframes for translations by an order of magnitude. In addition, an Adaptive Keyframe Selection scheme is also equipped to select the representative frames for joint editing, further improving generation quality. With these innovative designs, AdaFlow achieves high-quality long video editing of minutes in one inference, i.e., more than 1$k$ frames on one A800 GPU, which is about ten times longer than the compared methods, e.g., TokenFlow. To validate AdaFlow, we also build a new benchmark for long video editing with high-quality annotations, termed LongV-EVAL. Our code is released at: https://github.com/jidantang55/AdaFlow.

AdaFlow: Efficient Long Video Editing via Adaptive Attention Slimming And Keyframe Selection

TL;DR

AdaFlow tackles the memory bottleneck in text-driven long video editing by introducing two training-free mechanisms: Adaptive Keyframe Selection (AKS) and Adaptive Attention Slimming (AAS). AKS partitions video content into clips and selects representative keyframes, while AAS prunes the KV sequence in Extended Self-Attention to enable more keyframes and longer edits, with latent propagation ensuring frame-to-frame continuity via precomputed token correspondences. The authors validate on LongV-EVAL, a new 75-video benchmark with high-quality annotations, and show AdaFlow edits sequences of over frames in one inference on an A800 GPU, outperforming several baselines in efficiency and quality. The work offers a practical, resource-efficient approach to long video editing and provides a benchmark for future evaluation.

Abstract

Despite great progress, text-driven long video editing is still notoriously challenging mainly due to excessive memory overhead. Although recent efforts have simplified this task into a two-step process of keyframe translation and interpolation generation, the token-wise keyframe translation still plagues the upper limit of video length. In this paper, we propose a novel and training-free approach towards efficient and effective long video editing, termed AdaFlow. We first reveal that not all tokens of video frames hold equal importance for keyframe translation, based on which we propose an Adaptive Attention Slimming scheme for AdaFlow to squeeze the sequence, thus increasing the number of keyframes for translations by an order of magnitude. In addition, an Adaptive Keyframe Selection scheme is also equipped to select the representative frames for joint editing, further improving generation quality. With these innovative designs, AdaFlow achieves high-quality long video editing of minutes in one inference, i.e., more than 1 frames on one A800 GPU, which is about ten times longer than the compared methods, e.g., TokenFlow. To validate AdaFlow, we also build a new benchmark for long video editing with high-quality annotations, termed LongV-EVAL. Our code is released at: https://github.com/jidantang55/AdaFlow.

Paper Structure

This paper contains 19 sections, 9 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: The proposed AdaFlow can support the text-driven video editing of more than 1$k$ frames in one inference. Meanwhile, AdaFlow can adaptively select the representative frames for keyframe translation, ensuring the continuity and quality of long video editing.
  • Figure 2: The framework of the proposed AdaFlow. (a) The pipeline of AdaFlow for long video editing. Given a source video and the text editing prompt, AdaFlow first applies Adaptive Keyframe Selection (AKS) (b) to adaptively divide the video into clips according to its content and then sample frames for keyframe translation. Afterwards, Adaptive Attention Slimming (AAS) (c) is applied to reduce the redundant tokens in Extended Self-Attention for keyframe translation, thereby increasing the number of frames edited. Finally, the editing information of the keyframes is propagated throughout the entire video.
  • Figure 3: Comparisons of AdaFlow with a set of advanced video editing methods (a) and ablation study for Adaptive Keyframe Selection (AKS) (b). (a) The red box refers to the failed editing of advanced video editing methods, e.g., the changes of objects or background, or the inconsistency between frames. Compared with the other methods, our AdaFlow can not only process videos of up to 1$k$ frames in one inference but also can well keep the quality and continuity of edited videos. (b) The ablation shows that AKS can capture the abrupt changes of edited videos to ensure the editing quality, e.g., the appearance of the car (above), or the girl dancing (below). Without AKS, the rapidly changing parts of the video are often blurry.
  • Figure 4: Examples of results for dataset annotation. Each source video is accompanied by three different prompts that focus on three aspects: foreground, background, and style.
  • Figure 5: Additional Qualitative Results. Our method supports a wide variety of text-driven video edits and maintains high editing quality and temporal consistency even for videos exceeding a thousand frames.
  • ...and 4 more figures