Table of Contents
Fetching ...

SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution

Chao Wang, Zijin Yang, Yaofei Wang, Yuang Qi, Weiming Zhang, Nenghai Yu, Kejiang Chen

TL;DR

This work defines for the first time the "few-shot training-free generated video attribution" task and proposes SWIFT, which is tightly integrated with the temporal characteristics of the video, which achieves over 90% average attribution accuracy with merely 20 video samples across all models.

Abstract

Recent advancements in video generation technologies have been significant, resulting in their widespread application across multiple domains. However, concerns have been mounting over the potential misuse of generated content. Tracing the origin of generated videos has become crucial to mitigate potential misuse and identify responsible parties. Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the "few-shot training-free generated video attribution" task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. By leveraging the "Pixel Frames(many) to Latent Frame(one)" temporal mapping within each video chunk, SWIFT applies a fixed-length sliding window to perform two distinct reconstructions: normal and corrupted. The variation in the losses between two reconstructions is then used as an attribution signal. We conducted an extensive evaluation of five state-of-the-art (SOTA) video generation models. Experimental results show that SWIFT achieves over 90% average attribution accuracy with merely 20 video samples across all models and even enables zero-shot attribution for HunyuanVideo, EasyAnimate, and Wan2.2. Our source code is available at https://github.com/wangchao0708/SWIFT.

SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution

TL;DR

This work defines for the first time the "few-shot training-free generated video attribution" task and proposes SWIFT, which is tightly integrated with the temporal characteristics of the video, which achieves over 90% average attribution accuracy with merely 20 video samples across all models.

Abstract

Recent advancements in video generation technologies have been significant, resulting in their widespread application across multiple domains. However, concerns have been mounting over the potential misuse of generated content. Tracing the origin of generated videos has become crucial to mitigate potential misuse and identify responsible parties. Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the "few-shot training-free generated video attribution" task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. By leveraging the "Pixel Frames(many) to Latent Frame(one)" temporal mapping within each video chunk, SWIFT applies a fixed-length sliding window to perform two distinct reconstructions: normal and corrupted. The variation in the losses between two reconstructions is then used as an attribution signal. We conducted an extensive evaluation of five state-of-the-art (SOTA) video generation models. Experimental results show that SWIFT achieves over 90% average attribution accuracy with merely 20 video samples across all models and even enables zero-shot attribution for HunyuanVideo, EasyAnimate, and Wan2.2. Our source code is available at https://github.com/wangchao0708/SWIFT.
Paper Structure (17 sections, 9 equations, 3 figures, 10 tables)

This paper contains 17 sections, 9 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Comparison of existing attribution methods and SWIFT: (a) Watermark-based active attribution; (b) Training-based passive attribution; (c) SWIFT: few-shot training-free passive attribution.
  • Figure 2: The 3D VAE performs up and down sampling operations along the temporal dimension (Temporal Compression Ratio = 4) and executes two distinct reconstructions (Normal and Corrupted).
  • Figure 3: The framework of SWIFT consists of three key modules: determination of a fixed-length sliding window, normal and corrupted reconstruction, and threshold determination. SWIFT uses the average loss ratio of overlapping frames between two differential reconstructions as the attribution signal. This signal is then compared with the threshold obtained through KDE to derive the final attribution result.