Table of Contents
Fetching ...

Towards Multi-domain Face Landmark Detection with Synthetic Data from Diffusion model

Yuanming Li, Gwantae Kim, Jeong-gi Kwak, Bon-hwa Ku, Hanseok Ko

TL;DR

The paper tackles the problem of multi-domain facial landmark detection under limited annotated data. It introduces a two-stage diffusion-based data synthesis framework that learns landmark-aligned face generation and domain control via text prompts, producing about 10k synthetic samples across 25 styles. A pre-trained landmark detector is then fine-tuned on this synthetic data, achieving state-of-the-art results on ArtFace and competitive performance on CariFace, demonstrating data-efficient cross-domain capabilities. This approach enables robust landmark localization across artistic domains and real-world faces, with potential applications in AR/VR and digital avatars.

Abstract

Recently, deep learning-based facial landmark detection for in-the-wild faces has achieved significant improvement. However, there are still challenges in face landmark detection in other domains (e.g. cartoon, caricature, etc). This is due to the scarcity of extensively annotated training data. To tackle this concern, we design a two-stage training approach that effectively leverages limited datasets and the pre-trained diffusion model to obtain aligned pairs of landmarks and face in multiple domains. In the first stage, we train a landmark-conditioned face generation model on a large dataset of real faces. In the second stage, we fine-tune the above model on a small dataset of image-landmark pairs with text prompts for controlling the domain. Our new designs enable our method to generate high-quality synthetic paired datasets from multiple domains while preserving the alignment between landmarks and facial features. Finally, we fine-tuned a pre-trained face landmark detection model on the synthetic dataset to achieve multi-domain face landmark detection. Our qualitative and quantitative results demonstrate that our method outperforms existing methods on multi-domain face landmark detection.

Towards Multi-domain Face Landmark Detection with Synthetic Data from Diffusion model

TL;DR

The paper tackles the problem of multi-domain facial landmark detection under limited annotated data. It introduces a two-stage diffusion-based data synthesis framework that learns landmark-aligned face generation and domain control via text prompts, producing about 10k synthetic samples across 25 styles. A pre-trained landmark detector is then fine-tuned on this synthetic data, achieving state-of-the-art results on ArtFace and competitive performance on CariFace, demonstrating data-efficient cross-domain capabilities. This approach enables robust landmark localization across artistic domains and real-world faces, with potential applications in AR/VR and digital avatars.

Abstract

Recently, deep learning-based facial landmark detection for in-the-wild faces has achieved significant improvement. However, there are still challenges in face landmark detection in other domains (e.g. cartoon, caricature, etc). This is due to the scarcity of extensively annotated training data. To tackle this concern, we design a two-stage training approach that effectively leverages limited datasets and the pre-trained diffusion model to obtain aligned pairs of landmarks and face in multiple domains. In the first stage, we train a landmark-conditioned face generation model on a large dataset of real faces. In the second stage, we fine-tune the above model on a small dataset of image-landmark pairs with text prompts for controlling the domain. Our new designs enable our method to generate high-quality synthetic paired datasets from multiple domains while preserving the alignment between landmarks and facial features. Finally, we fine-tuned a pre-trained face landmark detection model on the synthetic dataset to achieve multi-domain face landmark detection. Our qualitative and quantitative results demonstrate that our method outperforms existing methods on multi-domain face landmark detection.
Paper Structure (10 sections, 4 equations, 4 figures, 1 table)

This paper contains 10 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of our proposed framework. We propose a two-stage framework for creating a high-quality multi-domain face from a face landmark with diffusion prior. (a) First stage, we train our model with a pair of real faces and landmarks. (b) Second stage, we further fine-tune the model with a small multi-domain dataset. (c) We generate diverse facial data by controlling both the text and landmarks. (d) We fine-tune the pre-trained landmark detector with the synthetic dataset.
  • Figure 2: Sample results of our method.
  • Figure 3: Qualitative comparison between evaluated methods.
  • Figure 4: Ablation study of our method.