Table of Contents
Fetching ...

SAGA: Source Attribution of Generative AI Videos

Rohit Kundu, Vishal Mohanty, Hao Xiong, Shan Jia, Athula Balachandran, Amit K. Roy-Chowdhury

TL;DR

SAGA tackles the urgent problem of attributing AI-generated videos to their exact generative source, moving beyond binary real/fake detection. It presents a data-efficient, two-stage approach that builds a video transformer on top of rich vision foundation features, first mastering binary classification and then adapting to multi-class attribution with a contrastive objective that employs hard negative mining. The framework supports five attribution levels (BIN-L, TASK-L, SD-L, TEAM-L, GEN-L) and introduces Temporal Attention Signatures (T-Sigs) for interpretable, temporal fingerprints of generators. Empirically, SAGA achieves state-of-the-art results across in-domain and cross-domain settings on 19 generators, with only $0.5\%$ of source-labeled data needed for fine-grained attribution, thereby enabling practical forensic and regulatory use and setting a new benchmark for AI video provenance.

Abstract

The proliferation of generative AI has led to hyper-realistic synthetic videos, escalating misuse risks and outstripping binary real/fake detectors. We introduce SAGA (Source Attribution of Generative AI videos), the first comprehensive framework to address the urgent need for AI-generated video source attribution at a large scale. Unlike traditional detection, SAGA identifies the specific generative model used. It uniquely provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights. Our novel video transformer architecture, leveraging features from a robust vision foundation model, effectively captures spatio-temporal artifacts. Critically, we introduce a data-efficient pretrain-and-attribute strategy, enabling SAGA to achieve state-of-the-art attribution using only 0.5\% of source-labeled data per class, matching fully supervised performance. Furthermore, we propose Temporal Attention Signatures (T-Sigs), a novel interpretability method that visualizes learned temporal differences, offering the first explanation for why different video generators are distinguishable. Extensive experiments on public datasets, including cross-domain scenarios, demonstrate that SAGA sets a new benchmark for synthetic video provenance, providing crucial, interpretable insights for forensic and regulatory applications.

SAGA: Source Attribution of Generative AI Videos

TL;DR

SAGA tackles the urgent problem of attributing AI-generated videos to their exact generative source, moving beyond binary real/fake detection. It presents a data-efficient, two-stage approach that builds a video transformer on top of rich vision foundation features, first mastering binary classification and then adapting to multi-class attribution with a contrastive objective that employs hard negative mining. The framework supports five attribution levels (BIN-L, TASK-L, SD-L, TEAM-L, GEN-L) and introduces Temporal Attention Signatures (T-Sigs) for interpretable, temporal fingerprints of generators. Empirically, SAGA achieves state-of-the-art results across in-domain and cross-domain settings on 19 generators, with only of source-labeled data needed for fine-grained attribution, thereby enabling practical forensic and regulatory use and setting a new benchmark for AI video provenance.

Abstract

The proliferation of generative AI has led to hyper-realistic synthetic videos, escalating misuse risks and outstripping binary real/fake detectors. We introduce SAGA (Source Attribution of Generative AI videos), the first comprehensive framework to address the urgent need for AI-generated video source attribution at a large scale. Unlike traditional detection, SAGA identifies the specific generative model used. It uniquely provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights. Our novel video transformer architecture, leveraging features from a robust vision foundation model, effectively captures spatio-temporal artifacts. Critically, we introduce a data-efficient pretrain-and-attribute strategy, enabling SAGA to achieve state-of-the-art attribution using only 0.5\% of source-labeled data per class, matching fully supervised performance. Furthermore, we propose Temporal Attention Signatures (T-Sigs), a novel interpretability method that visualizes learned temporal differences, offering the first explanation for why different video generators are distinguishable. Extensive experiments on public datasets, including cross-domain scenarios, demonstrate that SAGA sets a new benchmark for synthetic video provenance, providing crucial, interpretable insights for forensic and regulatory applications.

Paper Structure

This paper contains 7 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: SAGA: Data-Efficient & Interpretable AI Video Source Attribution. (a) Temporal Attention Signatures (T-Sigs): SAGA pioneers AI video source attribution. Our novel T-Sigs provide interpretability, showing unique fingerprints for Real, Seen, and even Unseen generators. (b) Feature Separability: t-SNE visualization of learned features demonstrates clear generator clusters. (c) Multi-Granular Performance & Data Efficiency: SAGA excels across 5 attribution levels. Radar chart shows our 2-stage training method using the Hard Negative Mining (HNM) objective, using only 0.5% labeled data, matches fully supervised performance and surpasses baselines.
  • Figure 2: Overall framework of SAGA with a two-stage training approach. In Stage-1, each video $x_k$ with real/fake labels is processed through a frozen foundational vision encoder to extract image-level features $z_m$, which are stacked in temporal order to form the video representation $\zeta_k$. Positional encoding is added, and the sequence is passed through our video transformer architecture $\theta$ (Sec. \ref{['subsec:transformer']}) to obtain $\phi_k$. The classifier $\beta_1$ maps $\phi_k$ to real or fake classes using a cross-entropy loss ($\mathcal{L}_{CE}$). In Stage-2, the pretrained video transformer is adapted for attribution into $n_c$ classes ($n_c$ defined by the attribution task; see Sec. \ref{['sec:method']}) using only 0.5% of source labeled data. Stage-2 incorporates an additional hard negative mining objective ($\mathcal{L}_{\text{HNM}}$, Sec. \ref{['subsec:contrastive']}) along with $\mathcal{L}_{CE}$ for the attribution task.
  • Figure 3: HNM enables better separation boundaries between classes while semi-HNM will exclude these samples from the loss.
  • Figure 4: t-SNE visualization of SAGA's learned representations trained on the TASK-L, BIN-L, SD-L and TEAM-L attribution tasks, respectively. Even when supervised at coarser levels, SAGA distinctly clusters individual generators, revealing strong fine-grained discriminative ability.
  • Figure 5: t-SNE visualization of SAGA on the GEN-L attribution task with different loss functions.
  • ...and 1 more figures