Table of Contents
Fetching ...

Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, Diana Marculescu

TL;DR

StreamV2V tackles real-time streaming video-to-video translation by introducing a backward-looking feature bank that connects current frames to past frames, enabling prompt-driven edits without training and achieving $O(1)$ memory growth with respect to video length. It extends diffusion-based generation with two training-free mechanisms: Extended self-Attention (EA) that integrates banked keys/values into frame processing and Explicit Feature Fusion (FF) that reuses similar past features to stabilize details, supplemented by a Dynamic Merging (DyMe) strategy to keep the bank compact. The method delivers real-time performance (up to 20 FPS on a single $A100$) and favorable temporal-consistency metrics and user-preference results against streaming baselines, while remaining a drop-in add-on to existing image diffusion models. These contributions enable scalable, long-form V2V translation suitable for applications like webcam translation and iterative drawing, with practical impact on real-time creative editing and film workflows.

Abstract

This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames, we opt to process frames in a streaming fashion, to support unlimited frames. At the heart of StreamV2V lies a backward-looking principle that relates the present to the past. This is realized by maintaining a feature bank, which archives information from past frames. For incoming frames, StreamV2V extends self-attention to include banked keys and values and directly fuses similar past features into the output. The feature bank is continually updated by merging stored and new features, making it compact but informative. StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without fine-tuning. It can run 20 FPS on one A100 GPU, being 15x, 46x, 108x, and 158x faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative metrics and user studies confirm StreamV2V's exceptional ability to maintain temporal consistency.

Looking Backward: Streaming Video-to-Video Translation with Feature Banks

TL;DR

StreamV2V tackles real-time streaming video-to-video translation by introducing a backward-looking feature bank that connects current frames to past frames, enabling prompt-driven edits without training and achieving memory growth with respect to video length. It extends diffusion-based generation with two training-free mechanisms: Extended self-Attention (EA) that integrates banked keys/values into frame processing and Explicit Feature Fusion (FF) that reuses similar past features to stabilize details, supplemented by a Dynamic Merging (DyMe) strategy to keep the bank compact. The method delivers real-time performance (up to 20 FPS on a single ) and favorable temporal-consistency metrics and user-preference results against streaming baselines, while remaining a drop-in add-on to existing image diffusion models. These contributions enable scalable, long-form V2V translation suitable for applications like webcam translation and iterative drawing, with practical impact on real-time creative editing and film workflows.

Abstract

This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames, we opt to process frames in a streaming fashion, to support unlimited frames. At the heart of StreamV2V lies a backward-looking principle that relates the present to the past. This is realized by maintaining a feature bank, which archives information from past frames. For incoming frames, StreamV2V extends self-attention to include banked keys and values and directly fuses similar past features into the output. The feature bank is continually updated by merging stored and new features, making it compact but informative. StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without fine-tuning. It can run 20 FPS on one A100 GPU, being 15x, 46x, 108x, and 158x faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative metrics and user studies confirm StreamV2V's exceptional ability to maintain temporal consistency.
Paper Structure (40 sections, 2 equations, 18 figures, 7 tables)

This paper contains 40 sections, 2 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: We present StreamV2V to support real-time video-to-video translation for streaming input. For webcam input, our StreamV2V supports face swap (e.g., to Elon Musk) and video stylization (e.g., to doodle art). Additionally, StreamV2V provides drawing rendering capabilities, enabling iterative creation. We encourage readers to check our video results in the supplementary materials.
  • Figure 2: (a) Existing V2V methods process frames in batches, restricting them to a limited number of frames. (b) Our StreamV2V framework processes frames in streaming fashion, can operate on streaming videos in real-time. (c) Batch processing requires $O(N)$ memory for the video length $N$, whereas our StreamV2V only needs $O(1)$ memory regardless of video length.
  • Figure 3: Overview of StreamV2V. Left: StreamV2V connects the current frame to the past by maintaining a feature bank that stores the intermediate transformer features. For new incoming frames, StreamV2V fetches the stored features and uses them by Extended self-Attention (EA) and direct Feature Fusion (FF). Middle: EA concatenates the stored keys $K_{fb}$ and values $V_{fb}$ directly to that of the current frame in the self-attention computation (Section \ref{['sec:cross_frame_attention']}). Right: Operating on the output of transformer blocks, FF first retrieves the similar features in the bank via a cosine similarity matrix, and then conducts a weighted sum to fuse them (Section \ref{['sec:feature_fusion']}). The update method of the feature bank is elaborated in Section \ref{['sec:merging_bank']}.
  • Figure 4: Naive queue vs. our dynamic merging (DyMe). DyMe has a more compact and informative feature bank.
  • Figure 5: Qualitative comparison with representative V2V models. Prompt is 'A pixel art of a man doing a handstand on the street'. Our method stands out in terms of prompt alignment and overall frame consistency. We highly encourage readers to refer to video comparisons in our supplementary videos.
  • ...and 13 more figures