Table of Contents
Fetching ...

Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing

Danial Samadi Vahdati, Tai Duc Nguyen, Ekta Prashnani, Koki Nagano, David Luebke, Orazio Gallo, Matthew Stamm

TL;DR

This work tackles puppeteering in AI-based talking-head videoconferencing by exploiting biometric leakage in pose-expression latents. It introduces Enhanced Biometric Leakage (EBL) space learned with a pose-conditioned large-margin cosine loss (PC-LMCL) and uses a temporal LSTM to fuse evidence for real-time detection, all operating without RGB reconstruction or enrollment. Across fifteen generator/dataset combinations, the method achieves state-of-the-art detection (AUC > 0.97 on combined data; ~0.925 in cross-domain settings) and generalizes well to unseen domains while maintaining real-time performance on consumer-grade GPUs. This approach provides a practical, enrollment-free safeguard that strengthens trust in bandwidth-efficient videoconferencing by authenticating driving versus target identities entirely in latent space.

Abstract

AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim's likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.

Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing

TL;DR

This work tackles puppeteering in AI-based talking-head videoconferencing by exploiting biometric leakage in pose-expression latents. It introduces Enhanced Biometric Leakage (EBL) space learned with a pose-conditioned large-margin cosine loss (PC-LMCL) and uses a temporal LSTM to fuse evidence for real-time detection, all operating without RGB reconstruction or enrollment. Across fifteen generator/dataset combinations, the method achieves state-of-the-art detection (AUC > 0.97 on combined data; ~0.925 in cross-domain settings) and generalizes well to unseen domains while maintaining real-time performance on consumer-grade GPUs. This approach provides a practical, enrollment-free safeguard that strengthens trust in bandwidth-efficient videoconferencing by authenticating driving versus target identities entirely in latent space.

Abstract

AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim's likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.

Paper Structure

This paper contains 29 sections, 1 theorem, 16 equations, 7 figures, 7 tables.

Key Result

Proposition 1

Assume all embeddings are $\ell_2$-normalized. If, for some $\epsilon,\gamma>0$, then the class centers $\mu_k = \mathbb{E}_{p}[R^{k,p}]$ satisfy Thus $\mathcal{L}_B$ enforces an inter-class angular margin of at least $\epsilon+\gamma$ within each pose slice.

Figures (7)

  • Figure 1: AI-based talking-head generators transmit only a compact pose-and-expression embedding for low-bandwidth videoconferencing, but remain vulnerable to puppeteering attacks that swap in a different identity for live impersonation. Our defense capitalizes on biometric signals inadvertently leaked in these embeddings to reveal mismatches between the driving speaker and the reconstructed identity in real time.
  • Figure 2: Illustration of three datasets (NVIDIA-VC prashnani2024avatar, RAVDESS livingstone2018ryerson, CREMA-D cao2014crema) shown across three columns each. Row 2 indicates the type of frame displayed in Row 3: Reference, Self-Reenacted, or Cross-Reenacted.
  • Figure 3: Similarity distributions in P&E space (left) and biometric leakage space (right). Red: same ID, diff. P&E; blue: diff. ID, same P&E; black: diff. ID, diff. P&E.
  • Figure 3: Statistics of the datasets and generated data used in this paper.
  • Figure 4: Detection AUC vs. window size and number of puppeteered identities during training.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Proposition 1