Table of Contents
Fetching ...

Adversarial Identity Injection for Semantic Face Image Synthesis

Giuseppe Tarollo, Tomaso Fontanini, Claudio Ferrari, Guido Borghi, Andrea Prati

TL;DR

This work tackles identity preservation in semantic face image synthesis by introducing a cross-attention-based CA^2SIS-inspired architecture that jointly leverages identity, style, and semantic cues. A pre-trained identity encoder supplies an identity embedding as an additional style input, coupled with an identity preservation loss, to strengthen identity fidelity in generated faces and enable identity swapping at inference. The approach is evaluated on CelebMask-HQ, showing improved identity preservation across multiple face-recognition models and enabling adversarial attacks that steer recognition toward a target identity, with attention maps revealing identity-influencing regions. Additionally, style-transfer experiments demonstrate that selective region swaps can enhance attack effectiveness while preserving human indistinguishability, highlighting important ethical considerations and the need for safeguards in biometric systems.

Abstract

Nowadays, deep learning models have reached incredible performance in the task of image generation. Plenty of literature works address the task of face generation and editing, with human and automatic systems that struggle to distinguish what's real from generated. Whereas most systems reached excellent visual generation quality, they still face difficulties in preserving the identity of the starting input subject. Among all the explored techniques, Semantic Image Synthesis (SIS) methods, whose goal is to generate an image conditioned on a semantic segmentation mask, are the most promising, even though preserving the perceived identity of the input subject is not their main concern. Therefore, in this paper, we investigate the problem of identity preservation in face image generation and present an SIS architecture that exploits a cross-attention mechanism to merge identity, style, and semantic features to generate faces whose identities are as similar as possible to the input ones. Experimental results reveal that the proposed method is not only suitable for preserving the identity but is also effective in the face recognition adversarial attack, i.e. hiding a second identity in the generated faces.

Adversarial Identity Injection for Semantic Face Image Synthesis

TL;DR

This work tackles identity preservation in semantic face image synthesis by introducing a cross-attention-based CA^2SIS-inspired architecture that jointly leverages identity, style, and semantic cues. A pre-trained identity encoder supplies an identity embedding as an additional style input, coupled with an identity preservation loss, to strengthen identity fidelity in generated faces and enable identity swapping at inference. The approach is evaluated on CelebMask-HQ, showing improved identity preservation across multiple face-recognition models and enabling adversarial attacks that steer recognition toward a target identity, with attention maps revealing identity-influencing regions. Additionally, style-transfer experiments demonstrate that selective region swaps can enhance attack effectiveness while preserving human indistinguishability, highlighting important ethical considerations and the need for safeguards in biometric systems.

Abstract

Nowadays, deep learning models have reached incredible performance in the task of image generation. Plenty of literature works address the task of face generation and editing, with human and automatic systems that struggle to distinguish what's real from generated. Whereas most systems reached excellent visual generation quality, they still face difficulties in preserving the identity of the starting input subject. Among all the explored techniques, Semantic Image Synthesis (SIS) methods, whose goal is to generate an image conditioned on a semantic segmentation mask, are the most promising, even though preserving the perceived identity of the input subject is not their main concern. Therefore, in this paper, we investigate the problem of identity preservation in face image generation and present an SIS architecture that exploits a cross-attention mechanism to merge identity, style, and semantic features to generate faces whose identities are as similar as possible to the input ones. Experimental results reveal that the proposed method is not only suitable for preserving the identity but is also effective in the face recognition adversarial attack, i.e. hiding a second identity in the generated faces.
Paper Structure (12 sections, 5 equations, 6 figures, 2 tables)

This paper contains 12 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the proposed architecture. Starting from a face image, style and identity features are extracted through the encoders (Style Encoder ($\mathcal{E}_{s}$), Identity Encoder ($\mathcal{E}_{id}$) and Mask Embedder ($\mathcal{E}_{m}$)) and used by the Generator ($\mathcal{G}$), together with the semantic segmentation mask to generate the final image.
  • Figure 2: Qualitative comparison between the original $CA^2SIS$ model fontanini2023semantic and the proposed architecture with cross attention-based identity injection (see Sect. \ref{['subsec:id_preservation']}). Identity-related details such as the color of the eyes, the eyebrows, and mouth shape or subtle details such as the teeth are better preserved when conditioning with identity embedding. Better seen on screen.
  • Figure 3: Results of our architecture obtained as described in Sect. \ref{['sec:attack']}. The first column is the attacker, the second column is the target, the third column is the reconstruction result of the attacker using the correct identity embedding, fourth column is the reconstruction result when injecting the identity of the target in the attacker. Finally, the last column represents the pixel difference between the two reconstruction results, highlighting that the identity information is effectively concealed in the manipulated face.
  • Figure 4: Cross-Attention layer visualization when swapping the identity of a target face to the attacker. The areas that are most affected by the identity injection are the eyes, eyebrows, nose, and mouth. This result suggests the perceived identity is complex information that is carried by several different facial traits.
  • Figure 5: Graph showing different Attack Success Rate (ASR) and LPIPS metric values when swapping different styles along the identity. The style swapping procedure is described in Sect. \ref{['sec:swap_styles']}.
  • ...and 1 more figures