mAVE: A Watermark for Joint Audio-Visual Generation Models

Luyang Si; Leyi Pan; Lijie Wen

mAVE: A Watermark for Joint Audio-Visual Generation Models

Luyang Si, Leyi Pan, Lijie Wen

Abstract

As Joint Audio-Visual Generation Models see widespread commercial deployment, embedding watermarks has become essential for protecting vendor copyright and ensuring content provenance. However, existing techniques suffer from an architectural mismatch by treating modalities as decoupled entities, exposing a critical Binding Vulnerability. Adversaries exploit this via Swap Attacks by replacing authentic audio with malicious deepfakes while retaining the watermarked video. Because current detectors rely on independent verification ($Video_{wm}\vee Audio_{wm}$), they incorrectly authenticate the manipulated content, falsely attributing harmful media to the original vendor and severely damaging their reputation. To address this, we propose mAVE (Manifold Audio-Visual Entanglement), the first watermarking framework natively designed for joint architectures. mAVE cryptographically binds audio and video latents at initialization without fine-tuning, defining a Legitimate Entanglement Manifold via Inverse Transform Sampling. Experiments on state-of-the-art models (LTX-2, MOVA) demonstrate that mAVE guarantees performance-losslessness and provides an exponential security bound against Swap Attacks. Achieving near-perfect binding integrity ($>99\%$), mAVE offers a robust cryptographic defense for vendor copyright.

mAVE: A Watermark for Joint Audio-Visual Generation Models

Abstract

), they incorrectly authenticate the manipulated content, falsely attributing harmful media to the original vendor and severely damaging their reputation. To address this, we propose mAVE (Manifold Audio-Visual Entanglement), the first watermarking framework natively designed for joint architectures. mAVE cryptographically binds audio and video latents at initialization without fine-tuning, defining a Legitimate Entanglement Manifold via Inverse Transform Sampling. Experiments on state-of-the-art models (LTX-2, MOVA) demonstrate that mAVE guarantees performance-losslessness and provides an exponential security bound against Swap Attacks. Achieving near-perfect binding integrity (

), mAVE offers a robust cryptographic defense for vendor copyright.

Paper Structure (56 sections, 5 theorems, 45 equations, 3 figures, 12 tables)

This paper contains 56 sections, 5 theorems, 45 equations, 3 figures, 12 tables.

Introduction
Related Work
From Single-Modality to Joint Audio-Visual Generation
Video Watermarking
Audio Watermarking and The Binding Gap
Threat Model and Problem Formulation
Method
Constructing the Authentic Manifold
Embed: Inverse Transform Sampling on Manifolds
Detect: Joint Inversion & Verification
Security Analysis
Experiments
Implementation Details.
Extraction Performance
Fidelity: Verification of Losslessness
...and 41 more sections

Key Result

theorem 1

mAVE is performance-lossless under chosen watermark tests. That is, for any polynomial-time tester $\mathcal{A}$ and key $K_{sess} \leftarrow \text{KeyGen}(1^\rho)$, where $\mathbf{z}^s$ is the watermarked latent and $\mathbf{z} \sim \mathcal{N}(0,I)$.

Figures (3)

Figure 1: Secure mAVE Entanglement.(Left) Standard joint audio-visual generation models rely on decoupled watermarks, leading to an Authentication Bypass. A Swap Attack replaces authentic audio ($z_a$) with a malicious audio track. Since both may possess valid independent watermarks, a decoupled detector is Fooled, and incorrectly flags the manipulated content as authentic. (Right) Our mAVE framework enforces a Cryptographic Binding at initialization. By securely entangling the audio latent to the video latent ($z_a=f(z_v)$), we construct a formal Entanglement Manifold. Any swapped audio breaks this functional dependency, enabling our joint detector to confidently Intercept the attack.
Figure 2: Overview of mAVE. Our training-free method restricts the joint generation process to a cryptographically entangled manifold. Left: We first construct discrete grids where the audio grid ($B_a$) is cryptographically bound to the video grid ($B_v$) via a hash digest. Middle: These grids are randomized and projected into continuous latent space to form the initial noise latents ($\mathbf{z}_v, \mathbf{z}_a$), mathematically defining the Authentic Manifold $\mathcal{M}$. Right: A joint audio-visual generation model denoises these entangled latents in a unified forward pass, intrinsically preventing decoupling and Swap Attacks.
Figure 3: Security Analysis. (a) Weak Baseline Failure: Uncoupled watermarks exhibit complete distributional overlap. (b) Strong Baseline Limitation: Adding SyncNet improves separation but still allows significant overlap on ambiguous samples. (c) mAVE Separation: Our method cryptographically enforces binding, creating definitive separation between Authentic and Swapped pairs. (d) Forensic ROC: mAVE maintains high TPR at low FPR, while heuristic baselines perform poorly for high-security applications.

Theorems & Definitions (10)

theorem 1
proof
theorem 2
proof
lemma 1: Distribution Preservation
proof
lemma 2: Independence under Mismatched Sessions
proof
lemma 3: Encrypted Optimization Objective
proof

mAVE: A Watermark for Joint Audio-Visual Generation Models

Abstract

mAVE: A Watermark for Joint Audio-Visual Generation Models

Authors

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (10)