Table of Contents
Fetching ...

FreeUV: Ground-Truth-Free Realistic Facial UV Texture Recovery via Cross-Assembly Inference Strategy

Xingchao Yang, Takafumi Taketomi, Yuki Endo, Yoshihiro Kanamori

TL;DR

FreeUV tackles the problem of generating high-quality 3D facial UV textures from a single 2D image without ground-truth UV data. It introduces a dual-network architecture that separates appearance (in-the-wild realism) and structure (3DMM-based geometry) and fuses them at inference through Cross-Assembly inside a pre-trained diffusion model, aided by CLIP and ControlNet conditioning. The approach demonstrates superior texture fidelity, robustness to occlusions and makeup, and enables practical applications such as local editing, feature interpolation, and multi-view texture recovery, while requiring substantially less annotated data. This data-efficient framework advances realistic UV texture reconstruction in real-world scenarios by leveraging stable diffusion with targeted structure guidance and appearance refinement. The findings indicate strong potential for scalable, high-fidelity facial texture generation in graphics and vision applications.

Abstract

Recovering high-quality 3D facial textures from single-view 2D images is a challenging task, especially under constraints of limited data and complex facial details such as makeup, wrinkles, and occlusions. In this paper, we introduce FreeUV, a novel ground-truth-free UV texture recovery framework that eliminates the need for annotated or synthetic UV data. FreeUV leverages pre-trained stable diffusion model alongside a Cross-Assembly inference strategy to fulfill this objective. In FreeUV, separate networks are trained independently to focus on realistic appearance and structural consistency, and these networks are combined during inference to generate coherent textures. Our approach accurately captures intricate facial features and demonstrates robust performance across diverse poses and occlusions. Extensive experiments validate FreeUV's effectiveness, with results surpassing state-of-the-art methods in both quantitative and qualitative metrics. Additionally, FreeUV enables new applications, including local editing, facial feature interpolation, and multi-view texture recovery. By reducing data requirements, FreeUV offers a scalable solution for generating high-fidelity 3D facial textures suitable for real-world scenarios.

FreeUV: Ground-Truth-Free Realistic Facial UV Texture Recovery via Cross-Assembly Inference Strategy

TL;DR

FreeUV tackles the problem of generating high-quality 3D facial UV textures from a single 2D image without ground-truth UV data. It introduces a dual-network architecture that separates appearance (in-the-wild realism) and structure (3DMM-based geometry) and fuses them at inference through Cross-Assembly inside a pre-trained diffusion model, aided by CLIP and ControlNet conditioning. The approach demonstrates superior texture fidelity, robustness to occlusions and makeup, and enables practical applications such as local editing, feature interpolation, and multi-view texture recovery, while requiring substantially less annotated data. This data-efficient framework advances realistic UV texture reconstruction in real-world scenarios by leveraging stable diffusion with targeted structure guidance and appearance refinement. The findings indicate strong potential for scalable, high-fidelity facial texture generation in graphics and vision applications.

Abstract

Recovering high-quality 3D facial textures from single-view 2D images is a challenging task, especially under constraints of limited data and complex facial details such as makeup, wrinkles, and occlusions. In this paper, we introduce FreeUV, a novel ground-truth-free UV texture recovery framework that eliminates the need for annotated or synthetic UV data. FreeUV leverages pre-trained stable diffusion model alongside a Cross-Assembly inference strategy to fulfill this objective. In FreeUV, separate networks are trained independently to focus on realistic appearance and structural consistency, and these networks are combined during inference to generate coherent textures. Our approach accurately captures intricate facial features and demonstrates robust performance across diverse poses and occlusions. Extensive experiments validate FreeUV's effectiveness, with results surpassing state-of-the-art methods in both quantitative and qualitative metrics. Additionally, FreeUV enables new applications, including local editing, facial feature interpolation, and multi-view texture recovery. By reducing data requirements, FreeUV offers a scalable solution for generating high-fidelity 3D facial textures suitable for real-world scenarios.

Paper Structure

This paper contains 20 sections, 2 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Examples of FreeUV results. Top to bottom: input face images, recovered UV textures, and FLAME model-based rendering. FreeUV generates a complete UV texture from a single face image without requiring ground-truth UV supervision during training. The method captures intricate details, such as facial hair, wrinkles, occlusions, and makeup, while demonstrating robustness across diverse scenarios, achieving high fidelity and coherent texture recovery.
  • Figure 2: Example of data and domain characteristics used in FreeUV. Realistic textures are derived from in-the-wild data, while structurally consistent textures are generated from parametric 3DMM data.
  • Figure 3: Overview of FreeUV Framework. FreeUV leverages two modules, the Flaw-Tolerant Detail Extractor ${\psi}_a$ (left) and the UV Structure Aligner ${\psi}_s$ (middle), to separately capture realistic appearance and structural consistency. Combined during the Cross-Assembly inference phase (right), these modules produce high-quality UV textures from single-view images, without requiring ground-truth UV data.
  • Figure 4: Comparison of 3D face reconstruction results. Our method achieves the closest match to the original input by rendering and overlaying the recovered UV texture. Even under challenging conditions, such as extreme lighting, facial hair, and occlusions, our approach preserves fine details and color consistency.
  • Figure 5: Comparison with makeup-focused reconstruction method. Our approach captures finer details, accurately preserving makeup features with greater clarity and consistency.
  • ...and 14 more figures