Table of Contents
Fetching ...

Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos

Ekta Prashnani, Koki Nagano, Shalini De Mello, David Luebke, Orazio Gallo

TL;DR

The paper defines avatar fingerprinting as verifying the authorized driving identity behind synthetic talking-head videos, independent of the target appearance. It introduces a motion-based dynamic identity embedding learned via a temporal CNN and a novel contrastive loss that pulls together videos driven by the same identity while pushing apart others, enriched by a time-shuffle term to emphasize temporal dynamics. A large NVFAIR dataset with real and synthetic self- and cross-reenactments across three generators is released to support this task. Empirical results show an average AUC around 0.85 with strong generalization to unseen generators, establishing a foundation for trustworthy use of synthetic avatars and highlighting directions for broader future work and safeguards.

Abstract

Modern avatar generators allow anyone to synthesize photorealistic real-time talking avatars, ushering in a new era of avatar-based human communication, such as with immersive AR/VR interactions or videoconferencing with limited bandwidths. Their safe adoption, however, requires a mechanism to verify if the rendered avatar is trustworthy: does it use the appearance of an individual without their consent? We term this task avatar fingerprinting. To tackle it, we first introduce a large-scale dataset of real and synthetic videos of people interacting on a video call, where the synthetic videos are generated using the facial appearance of one person and the expressions of another. We verify the identity driving the expressions in a synthetic video, by learning motion signatures that are independent of the facial appearance shown. Our solution, the first in this space, achieves an average AUC of 0.85. Critical to its practical use, it also generalizes to new generators never seen in training (average AUC of 0.83). The proposed dataset and other resources can be found at: https://research.nvidia.com/labs/nxp/avatar-fingerprinting/.

Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos

TL;DR

The paper defines avatar fingerprinting as verifying the authorized driving identity behind synthetic talking-head videos, independent of the target appearance. It introduces a motion-based dynamic identity embedding learned via a temporal CNN and a novel contrastive loss that pulls together videos driven by the same identity while pushing apart others, enriched by a time-shuffle term to emphasize temporal dynamics. A large NVFAIR dataset with real and synthetic self- and cross-reenactments across three generators is released to support this task. Empirical results show an average AUC around 0.85 with strong generalization to unseen generators, establishing a foundation for trustworthy use of synthetic avatars and highlighting directions for broader future work and safeguards.

Abstract

Modern avatar generators allow anyone to synthesize photorealistic real-time talking avatars, ushering in a new era of avatar-based human communication, such as with immersive AR/VR interactions or videoconferencing with limited bandwidths. Their safe adoption, however, requires a mechanism to verify if the rendered avatar is trustworthy: does it use the appearance of an individual without their consent? We term this task avatar fingerprinting. To tackle it, we first introduce a large-scale dataset of real and synthetic videos of people interacting on a video call, where the synthetic videos are generated using the facial appearance of one person and the expressions of another. We verify the identity driving the expressions in a synthetic video, by learning motion signatures that are independent of the facial appearance shown. Our solution, the first in this space, achieves an average AUC of 0.85. Critical to its practical use, it also generalizes to new generators never seen in training (average AUC of 0.83). The proposed dataset and other resources can be found at: https://research.nvidia.com/labs/nxp/avatar-fingerprinting/.
Paper Structure (46 sections, 6 equations, 8 figures, 5 tables)

This paper contains 46 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Talking-head avatar generators can synthesize realistic videos of a target identity from driving videos of different identities. Our method extracts appearance-agnostic temporal facial features and learns an embedding in which the synthetic videos driven by one identity fall close to each other and far from those driven by other identities, regardless of the appearance of the synthetic video. By comparing distances in the embedding space, we evaluate whether an avatar is driven by an authorized identity or not. During evaluation, we only rely on the synthetic videos as input, without requiring any prior knowledge about the driving identity.
  • Figure 2: We introduce the NVFAIR dataset, containing real and synthetic talking-head videos. We capture subjects talking in both scripted and free-form settings. To encourage natural performance, we record the subjects while videoconferencing with each other (left). We then synthesize more than $650,000$ talking-head videos---the largest collection till date---using three state-of-the-art face-reenactment talking-head generators. On the right, each row corresponds to a driving identity ($\text{ID}_i\rightarrow(\cdot)$) and each column corresponds to a different target identity ($(\cdot)\rightarrow\text{ID}_i$). The videos in which driving and target identity match are self-reenactments, the rest are cross-reenactments.
  • Figure 3: We extract landmarks from the frames of a talking-head clip, compute their normalized pairwise distances, and concatenate the frame-wise features. We then learn an identity embedding using a loss that pulls closer features of videos driven by the same identity and pushes away those driven by others. $\text{ID}_i\rightarrow\text{ID}_j$ indicates a video that looks like identity $j$ (the "target" identity), and is driven by identity $i$.
  • Figure 4: Animated figure. Open in a media-enabled viewer like Adobe Reader and click on the inset. Our embeddings capture the dynamics of an expression, rather than the appearance of the face. For each row, we pick a reference identity. The green box indicates reenactments driven by the reference identity, the red and blue are cross-reenactments of the reference identity. We compute the average distance of each clip shown here against all other clips driven by the reference identity. The average distance to the other clips of the reference identity is consistent for a given motion, and lower (better) when the reference identity is driving as compared to the cross-reenactments that use the reference identity as target. Here, we show videos generated by face-vid2vid wang2021facevid2vid, and use the embedding vectors predicted by the model trained on the same generator (see Figure \ref{['fig:new-generator-robust']} generalization to new generators not seen during training).
  • Figure 5: ROC curves and AUC values for our method and two baselines: Agarwal et al. Agarwal_2019_CVPR_Workshops and ID-Reveal Cozzolino_2021_ICCV. Each sub-plot shows the results on our test set for each of the three talking-head generators: face-vid2vid wang2021facevid2vid, LIA wang2022latent, and TPS tps2022.
  • ...and 3 more figures