AdaFlow: Efficient Long Video Editing via Adaptive Attention Slimming And Keyframe Selection

Shuheng Zhang; Yuqi Liu; Hongbo Zhou; Jun Peng; Yiyi Zhou; Xiaoshuai Sun; Rongrong Ji

AdaFlow: Efficient Long Video Editing via Adaptive Attention Slimming And Keyframe Selection

Shuheng Zhang, Yuqi Liu, Hongbo Zhou, Jun Peng, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji

TL;DR

AdaFlow tackles the memory bottleneck in text-driven long video editing by introducing two training-free mechanisms: Adaptive Keyframe Selection (AKS) and Adaptive Attention Slimming (AAS). AKS partitions video content into clips and selects representative keyframes, while AAS prunes the KV sequence in Extended Self-Attention to enable more keyframes and longer edits, with latent propagation ensuring frame-to-frame continuity via precomputed token correspondences. The authors validate on LongV-EVAL, a new 75-video benchmark with high-quality annotations, and show AdaFlow edits sequences of over $1k$ frames in one inference on an A800 GPU, outperforming several baselines in efficiency and quality. The work offers a practical, resource-efficient approach to long video editing and provides a benchmark for future evaluation.

Abstract

Despite great progress, text-driven long video editing is still notoriously challenging mainly due to excessive memory overhead. Although recent efforts have simplified this task into a two-step process of keyframe translation and interpolation generation, the token-wise keyframe translation still plagues the upper limit of video length. In this paper, we propose a novel and training-free approach towards efficient and effective long video editing, termed AdaFlow. We first reveal that not all tokens of video frames hold equal importance for keyframe translation, based on which we propose an Adaptive Attention Slimming scheme for AdaFlow to squeeze the $KV$ sequence, thus increasing the number of keyframes for translations by an order of magnitude. In addition, an Adaptive Keyframe Selection scheme is also equipped to select the representative frames for joint editing, further improving generation quality. With these innovative designs, AdaFlow achieves high-quality long video editing of minutes in one inference, i.e., more than 1$k$ frames on one A800 GPU, which is about ten times longer than the compared methods, e.g., TokenFlow. To validate AdaFlow, we also build a new benchmark for long video editing with high-quality annotations, termed LongV-EVAL. Our code is released at: https://github.com/jidantang55/AdaFlow.

AdaFlow: Efficient Long Video Editing via Adaptive Attention Slimming And Keyframe Selection

TL;DR

Abstract

AdaFlow: Efficient Long Video Editing via Adaptive Attention Slimming And Keyframe Selection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)