Table of Contents
Fetching ...

mAVE: A Watermark for Joint Audio-Visual Generation Models

Luyang Si, Leyi Pan, Lijie Wen

Abstract

As Joint Audio-Visual Generation Models see widespread commercial deployment, embedding watermarks has become essential for protecting vendor copyright and ensuring content provenance. However, existing techniques suffer from an architectural mismatch by treating modalities as decoupled entities, exposing a critical Binding Vulnerability. Adversaries exploit this via Swap Attacks by replacing authentic audio with malicious deepfakes while retaining the watermarked video. Because current detectors rely on independent verification ($Video_{wm}\vee Audio_{wm}$), they incorrectly authenticate the manipulated content, falsely attributing harmful media to the original vendor and severely damaging their reputation. To address this, we propose mAVE (Manifold Audio-Visual Entanglement), the first watermarking framework natively designed for joint architectures. mAVE cryptographically binds audio and video latents at initialization without fine-tuning, defining a Legitimate Entanglement Manifold via Inverse Transform Sampling. Experiments on state-of-the-art models (LTX-2, MOVA) demonstrate that mAVE guarantees performance-losslessness and provides an exponential security bound against Swap Attacks. Achieving near-perfect binding integrity ($>99\%$), mAVE offers a robust cryptographic defense for vendor copyright.

mAVE: A Watermark for Joint Audio-Visual Generation Models

Abstract

As Joint Audio-Visual Generation Models see widespread commercial deployment, embedding watermarks has become essential for protecting vendor copyright and ensuring content provenance. However, existing techniques suffer from an architectural mismatch by treating modalities as decoupled entities, exposing a critical Binding Vulnerability. Adversaries exploit this via Swap Attacks by replacing authentic audio with malicious deepfakes while retaining the watermarked video. Because current detectors rely on independent verification (), they incorrectly authenticate the manipulated content, falsely attributing harmful media to the original vendor and severely damaging their reputation. To address this, we propose mAVE (Manifold Audio-Visual Entanglement), the first watermarking framework natively designed for joint architectures. mAVE cryptographically binds audio and video latents at initialization without fine-tuning, defining a Legitimate Entanglement Manifold via Inverse Transform Sampling. Experiments on state-of-the-art models (LTX-2, MOVA) demonstrate that mAVE guarantees performance-losslessness and provides an exponential security bound against Swap Attacks. Achieving near-perfect binding integrity (), mAVE offers a robust cryptographic defense for vendor copyright.
Paper Structure (56 sections, 5 theorems, 45 equations, 3 figures, 12 tables)

This paper contains 56 sections, 5 theorems, 45 equations, 3 figures, 12 tables.

Key Result

theorem 1

mAVE is performance-lossless under chosen watermark tests. That is, for any polynomial-time tester $\mathcal{A}$ and key $K_{sess} \leftarrow \text{KeyGen}(1^\rho)$, where $\mathbf{z}^s$ is the watermarked latent and $\mathbf{z} \sim \mathcal{N}(0,I)$.

Figures (3)

  • Figure 1: Secure mAVE Entanglement.(Left) Standard joint audio-visual generation models rely on decoupled watermarks, leading to an Authentication Bypass. A Swap Attack replaces authentic audio ($z_a$) with a malicious audio track. Since both may possess valid independent watermarks, a decoupled detector is Fooled, and incorrectly flags the manipulated content as authentic. (Right) Our mAVE framework enforces a Cryptographic Binding at initialization. By securely entangling the audio latent to the video latent ($z_a=f(z_v)$), we construct a formal Entanglement Manifold. Any swapped audio breaks this functional dependency, enabling our joint detector to confidently Intercept the attack.
  • Figure 2: Overview of mAVE. Our training-free method restricts the joint generation process to a cryptographically entangled manifold. Left: We first construct discrete grids where the audio grid ($B_a$) is cryptographically bound to the video grid ($B_v$) via a hash digest. Middle: These grids are randomized and projected into continuous latent space to form the initial noise latents ($\mathbf{z}_v, \mathbf{z}_a$), mathematically defining the Authentic Manifold $\mathcal{M}$. Right: A joint audio-visual generation model denoises these entangled latents in a unified forward pass, intrinsically preventing decoupling and Swap Attacks.
  • Figure 3: Security Analysis. (a) Weak Baseline Failure: Uncoupled watermarks exhibit complete distributional overlap. (b) Strong Baseline Limitation: Adding SyncNet improves separation but still allows significant overlap on ambiguous samples. (c) mAVE Separation: Our method cryptographically enforces binding, creating definitive separation between Authentic and Swapped pairs. (d) Forensic ROC: mAVE maintains high TPR at low FPR, while heuristic baselines perform poorly for high-security applications.

Theorems & Definitions (10)

  • theorem 1
  • proof
  • theorem 2
  • proof
  • lemma 1: Distribution Preservation
  • proof
  • lemma 2: Independence under Mismatched Sessions
  • proof
  • lemma 3: Encrypted Optimization Objective
  • proof