Table of Contents
Fetching ...

Leveraging Avatar Fingerprinting: A Multi-Generator Photorealistic Talking-Head Public Database and Benchmark

Laura Pedrouzo-Rodriguez, Luis F. Gomez, Ruben Tolosana, Ruben Vera-Rodriguez, Roberto Daza, Aythami Morales, Julian Fierrez

Abstract

Recent advances in photorealistic avatar generation have enabled highly realistic talking-head avatars, raising security concerns regarding identity impersonation in AI-mediated communication. To advance in this challenging problem, the task of avatar fingerprinting aims to determine whether two avatar videos are driven by the same human operator or not. However, current public databases in the literature are scarce and based solely on old-fashioned talking-head avatar generators, not representing realistic scenarios for the current task of avatar fingerprinting. To overcome this situation, the present article introduces AVAPrintDB, a new publicly available multi-generator talking-head avatar database for avatar fingerprinting. AVAPrintDB is constructed from two audiovisual corpora and three state-of-the-art avatar generators (GAGAvatar, LivePortrait, HunyuanPortrait), representing different synthesis paradigms, and includes both self- and cross-reenactments to simulate legitimate usage and impersonation scenarios. Building on this database, we also define a standardized and reproducible benchmark for avatar fingerprinting, considering public state-of-the-art avatar fingerprinting systems and exploring novel methods based on Foundation Models (DINOv2 and CLIP). Also, we conduct a comprehensive analysis under generator and dataset shift. Our results show that, while identity-related motion cues persist across synthetic avatars, current avatar fingerprinting systems remain highly sensitive to changes in the synthesis pipeline and source domain. The AVAPrintDB, benchmark protocols, and avatar fingerprinting systems are publicly available to facilitate reproducible research.

Leveraging Avatar Fingerprinting: A Multi-Generator Photorealistic Talking-Head Public Database and Benchmark

Abstract

Recent advances in photorealistic avatar generation have enabled highly realistic talking-head avatars, raising security concerns regarding identity impersonation in AI-mediated communication. To advance in this challenging problem, the task of avatar fingerprinting aims to determine whether two avatar videos are driven by the same human operator or not. However, current public databases in the literature are scarce and based solely on old-fashioned talking-head avatar generators, not representing realistic scenarios for the current task of avatar fingerprinting. To overcome this situation, the present article introduces AVAPrintDB, a new publicly available multi-generator talking-head avatar database for avatar fingerprinting. AVAPrintDB is constructed from two audiovisual corpora and three state-of-the-art avatar generators (GAGAvatar, LivePortrait, HunyuanPortrait), representing different synthesis paradigms, and includes both self- and cross-reenactments to simulate legitimate usage and impersonation scenarios. Building on this database, we also define a standardized and reproducible benchmark for avatar fingerprinting, considering public state-of-the-art avatar fingerprinting systems and exploring novel methods based on Foundation Models (DINOv2 and CLIP). Also, we conduct a comprehensive analysis under generator and dataset shift. Our results show that, while identity-related motion cues persist across synthetic avatars, current avatar fingerprinting systems remain highly sensitive to changes in the synthesis pipeline and source domain. The AVAPrintDB, benchmark protocols, and avatar fingerprinting systems are publicly available to facilitate reproducible research.

Paper Structure

This paper contains 26 sections, 7 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Visualization of self-reenactment and cross-reenactment examples for each avatar generator and dataset. Left 4 columns correspond to examples for a RAVDESS livingstone2018ryerson video, right 4 columns correspond to the examples for a CREMA-D cao2014crema video. Top row shows the frames from the two original videos used as driving videos for the avatars. $f$ indicates the frame number shown from the corresponding video.
  • Figure 2: Public avatar fingerprinting system based on a Graph Convolutional Network (GCN) model pedrouzo2025really. From an avatar video, first the landmarks are obtained for each of the frames, and using them as input, the GCN obtains a graph embedding for each of the frames. Finally a pooling block generates a final embedding for the video from the individual graph embeddings.
  • Figure 3: Proposed avatar fingerprinting system based on Foundation Models. The pretrained models (DINOv2 oquab2024dinov and CLIP radford2021learning) are frozen and only used to extract visual features for each of the frames. The embeddings obtained are then aggregated via a temporal attention block and a projection head.
  • Figure 4: Graphical representation of the video-to-video scoring process for an enrollment-test pair of avatar videos. For each video, a set of fixed-length windows is computed. Since the length of the videos can be different, the number of windows for both videos $X$ and $Y$ can also be different. The avatar fingerprinting system generates one embedding $z$ per window. The similarity matrix $s_{x,y}$ between the embeddings from all windows is computed and the final similarity score $S(V^{e},V^{t})$ is obtained by averaging all similarities.
  • Figure 5: Intra-dataset intra-generator t-SNE embedding visualizations. All obtained from CREMA-D test avatar videos generated with LivePortrait guo2024liveportrait generator. Left: embeddings obtained with Graph-based system. Right: embeddings obtained with DINOv2 system. Colors indicate the driving identity and marker types indicate the target identity.