Table of Contents
Fetching ...

StyleYourSmile: Cross-Domain Face Retargeting Without Paired Multi-Style Data

Avirup Dey, Vinay Namboodiri

TL;DR

StyleYourSmile tackles one-shot cross-domain face retargeting without curated multi-style data by coupling a light domain-style augmentation with a dual-encoder setup that disentangles identity from domain cues. A diffusion model conditioned by identity tokens and domain-style tokens, routed through ControlNet with spatial guidance from a 3DMM, enables faithful cross-domain retargeting while preserving fine-grained identity features and stylistic attributes. The approach achieves superior identity retention and style fidelity across unseen domains, with ablations validating the impact of style augmentation, ControlNet routing, and LoRA-based fine-tuning on performance. Overall, the method offers a computationally efficient, scalable solution for cross-domain retargeting suitable for practical applications and broader generalization.

Abstract

Cross-domain face retargeting requires disentangled control over identity, expressions, and domain-specific stylistic attributes. Existing methods, typically trained on real-world faces, either fail to generalize across domains, need test-time optimizations, or require fine-tuning with carefully curated multi-style datasets to achieve domain-invariant identity representations. In this work, we introduce \textit{StyleYourSmile}, a novel one-shot cross-domain face retargeting method that eliminates the need for curated multi-style paired data. We propose an efficient data augmentation strategy alongside a dual-encoder framework, for extracting domain-invariant identity cues and capturing domain-specific stylistic variations. Leveraging these disentangled control signals, we condition a diffusion model to retarget facial expressions across domains. Extensive experiments demonstrate that \textit{StyleYourSmile} achieves superior identity preservation and retargeting fidelity across a wide range of visual domains.

StyleYourSmile: Cross-Domain Face Retargeting Without Paired Multi-Style Data

TL;DR

StyleYourSmile tackles one-shot cross-domain face retargeting without curated multi-style data by coupling a light domain-style augmentation with a dual-encoder setup that disentangles identity from domain cues. A diffusion model conditioned by identity tokens and domain-style tokens, routed through ControlNet with spatial guidance from a 3DMM, enables faithful cross-domain retargeting while preserving fine-grained identity features and stylistic attributes. The approach achieves superior identity retention and style fidelity across unseen domains, with ablations validating the impact of style augmentation, ControlNet routing, and LoRA-based fine-tuning on performance. Overall, the method offers a computationally efficient, scalable solution for cross-domain retargeting suitable for practical applications and broader generalization.

Abstract

Cross-domain face retargeting requires disentangled control over identity, expressions, and domain-specific stylistic attributes. Existing methods, typically trained on real-world faces, either fail to generalize across domains, need test-time optimizations, or require fine-tuning with carefully curated multi-style datasets to achieve domain-invariant identity representations. In this work, we introduce \textit{StyleYourSmile}, a novel one-shot cross-domain face retargeting method that eliminates the need for curated multi-style paired data. We propose an efficient data augmentation strategy alongside a dual-encoder framework, for extracting domain-invariant identity cues and capturing domain-specific stylistic variations. Leveraging these disentangled control signals, we condition a diffusion model to retarget facial expressions across domains. Extensive experiments demonstrate that \textit{StyleYourSmile} achieves superior identity preservation and retargeting fidelity across a wide range of visual domains.

Paper Structure

This paper contains 27 sections, 5 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Arc2Face has strong identity retention but it cannot preserve the source style as the underlying face recognition encoder discards all information that is not relevant to a person's identity.
  • Figure 2: Model Overview: First, the source image $I_{src}$ is encoded as follows - (i) a face recognition encoder $\mathcal{E}_{id}$ extracts domain invariant indentity features and they are projected into CLIP text space by a decoder $\mathcal{P}_{id}$ as identity tokens $c_{id}$. (ii) A style encoder $\mathcal{E}_{sty}$ extracts domain specific style features and they are projected into CLIP text space by a decoder $\mathcal{P}_{sty}$ as style tokens $c_{sty}$. Simultaneously, a spatial conditioning image $I_{spt}$ is which is a composite of 3DMM landmarks and foreground masks, is computed from the target image $I_{tgt}$. Then, the denoising UNet, containing trainable low rank matrices, is optimized to disentangle identity and domain style, conditioned with $c_{id}$ and a ControlNet signal which combines $I_{spt}$ and $c_{sty}$.
  • Figure 3: We augment the training data with different styles, with varying degrees of abstraction. Training on such data incentivize the model to decouple identity from image style.
  • Figure 4: Style Injection Method chung2024style: First, the content and style images are inverted into latents $z^c_T$ and $z^s_T$ respectively. During inversion ($Q,K,V$) of both are cached. For generating the styled image, we start with AdaIn($z^c_T$,$z^s_T$) and inject the key-value pairs ($K^s,V^s$) from the style image into the decoder layers, where they are matched with corresponding queries $\Tilde{Q}^{cs}$
  • Figure 5: Visual comparison of various models on stylized VoxCeleb1 nagrani2020voxceleb test set. Our model outperforms previous models in terms of identity retention and style preservation.
  • ...and 8 more figures