Depth Map Denoising Network and Lightweight Fusion Network for Enhanced 3D Face Recognition

Ruizhuo Xu; Ke Wang; Chao Deng; Mei Wang; Xi Chen; Wenhui Huang; Junlan Feng; Weihong Deng

Depth Map Denoising Network and Lightweight Fusion Network for Enhanced 3D Face Recognition

Ruizhuo Xu, Ke Wang, Chao Deng, Mei Wang, Xi Chen, Wenhui Huang, Junlan Feng, Weihong Deng

TL;DR

This work tackles the challenge of noisy, low-quality depth maps for 3D face recognition by introducing DMDNet, a depth map denoising network based on the Denoising Implicit Image Function (DIIF). DMDNet incorporates coordinate-aware latent codes and multi-scale decoding to produce cleaner depth faces while preserving identity, and is complemented by LDNFNet, a lightweight fusion network that jointly leverages depth and normal modalities through a multi-branch fusion architecture. Across multiple datasets, DMDNet achieves superior denoising performance and, when combined with LDNFNet, sets new state-of-the-art results on the Lock3DFace database, with demonstrated generalization to diverse depth sensors. The approach advances 3D FR by integrating implicit representations for denoising and a compact, effective fusion mechanism for multimodal cues, offering practical impact for affordable, robust 3D facial recognition applications.

Abstract

With the increasing availability of consumer depth sensors, 3D face recognition (FR) has attracted more and more attention. However, the data acquired by these sensors are often coarse and noisy, making them impractical to use directly. In this paper, we introduce an innovative Depth map denoising network (DMDNet) based on the Denoising Implicit Image Function (DIIF) to reduce noise and enhance the quality of facial depth images for low-quality 3D FR. After generating clean depth faces using DMDNet, we further design a powerful recognition network called Lightweight Depth and Normal Fusion network (LDNFNet), which incorporates a multi-branch fusion block to learn unique and complementary features between different modalities such as depth and normal images. Comprehensive experiments conducted on four distinct low-quality databases demonstrate the effectiveness and robustness of our proposed methods. Furthermore, when combining DMDNet and LDNFNet, we achieve state-of-the-art results on the Lock3DFace database.

Depth Map Denoising Network and Lightweight Fusion Network for Enhanced 3D Face Recognition

TL;DR

Abstract

Paper Structure (23 sections, 7 equations, 8 figures, 9 tables)

This paper contains 23 sections, 7 equations, 8 figures, 9 tables.

Introduction
Related Works
Depth Map Denoising
Low-quality 3D Face Recognition
Implicit Neural Representations
Depth Map Denoising Network
Denoising Implicit Image Function
Convolutional Encoder
Decoding Function
Loss Function
Lightweight Depth and Normal Fusion Network
Experiment
Datasets and Preprocessing
Settings
Results
...and 8 more sections

Figures (8)

Figure 1: The overall structure of the proposed DMDNet. Our DMDNet consists of an encoder $E_{\phi}$ and a decoding function $f_{\theta}$. $E_{\phi}$ represents noisy depth image as a set of latent codes distributed in 2D spatial domain, while the $f_{\theta}$ takes the coordinate information and the latent code queried by the coordinate as input and predicts the denoised depth value at the given coordinate.
Figure 2: Illustration of DIIF. The input noisy depth map is represented as a feature map $F \in \mathbb{R}^{C \times H \times W}$ by the encoder $E_{\phi}$. Each feature vector $z_{i} \in \mathbb{R}^{C \times 1 \times 1}$ in the feature map (shown as a grid of yellow dashed lines in the middle plot) is assigned a coordinate. Given a query coordinate $x_{q}$, we first select the closest feature vector $z^{*}$ (depicted as the green dot in the middle plot with coordinate $x^{*}$) based on the Euclidean distance from $x_{q}$ to the position of each feature vector, and use it as the latent code for this local region. Then, the decoding function $f_{\theta}$ takes latent code $z^{*}$ and the distance vector ($x_{q} - x^{*}$) as inputs, and outputs predicted denoised depth values at $x_{q}$.
Figure 3: The detailed structure of the DMDNet encoder. The $3 \times 3$ convolutional layer here has a default stride of 1 and a padding of 1. The $\times N$ in the cuboid indicates the number of blocks.
Figure 4: A simple illustration of MSDF. The feature maps obtained from the Encoder at different resolutions are fed into the decoding function in tune, along with the query coordinates. Their intermediate features are fused by element-wise addition before the last linear layer. Finally, the fused features are fed into the final linear layer with Tanh activation to predict the denoised depth values.
Figure 5: (a) Architecture of our LDNFNet. The structure of the Backbone is the same as Led3D mu2019led3d. (b) Detailed structure of the Fusion Block. We take the multi-branch convolutional structure proposed in xie2017aggregated to build the fusion block. (c) Detailed structure of the ConvBlock.
...and 3 more figures

Depth Map Denoising Network and Lightweight Fusion Network for Enhanced 3D Face Recognition

TL;DR

Abstract

Depth Map Denoising Network and Lightweight Fusion Network for Enhanced 3D Face Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (8)