Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss

Qifan Fu; Xiaohang Yang; Muhammad Asad; Changjae Oh; Shanxin Yuan; Gregory Slabaugh

Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss

Qifan Fu, Xiaohang Yang, Muhammad Asad, Changjae Oh, Shanxin Yuan, Gregory Slabaugh

TL;DR

A novel Region-Aware Cycle Loss (RACL) is proposed that enables the diffusion model training to focus on improving the hand region, resulting in improved quality of generated hand gestures, especially the quality of the hand region.

Abstract

Diffusion models have shown their remarkable ability to synthesize images, including the generation of humans in specific poses. However, current models face challenges in adequately expressing conditional control for detailed hand pose generation, leading to significant distortion in the hand regions. To tackle this problem, we first curate the How2Sign dataset to provide richer and more accurate hand pose annotations. In addition, we introduce adaptive, multi-modal fusion to integrate characters' physical features expressed in different modalities such as skeleton, depth, and surface normal. Furthermore, we propose a novel Region-Aware Cycle Loss (RACL) that enables the diffusion model training to focus on improving the hand region, resulting in improved quality of generated hand gestures. More specifically, the proposed RACL computes a weighted keypoint distance between the full-body pose keypoints from the generated image and the ground truth, to generate higher-quality hand poses while balancing overall pose accuracy. Moreover, we use two hand region metrics, named hand-PSNR and hand-Distance for hand pose generation evaluations. Our experimental evaluations demonstrate the effectiveness of our proposed approach in improving the quality of digital human pose generation using diffusion models, especially the quality of the hand region. The source code is available at https://github.com/fuqifan/Region-Aware-Cycle-Loss.

Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss

TL;DR

Abstract

Paper Structure (18 sections, 4 equations, 5 figures, 4 tables)

This paper contains 18 sections, 4 equations, 5 figures, 4 tables.

Introduction
Related Work
Pose-Controlled Digital Human Generation
Multi-modal Control Fusion for Generative AI
Photo-Realistic Sign Language Production
Method
Preliminaries
Dataset Curation
Data Cleansing
Data Relabelling
Adaptive Multi-modal Control Fusion Network for ControlNet
Region-Aware Cycle Loss (RACL) for Model Training
Experiments
Implementation Details
Quantitative results
...and 3 more sections

Figures (5)

Figure 1: Annotation of human poses with different methods. The recently introduced DWPose method depicts higher quality keypoints and skeleton annotations as compared to OpenPose. We note that the skeleton lacks finer details such as the person's body shape. Depth annotation provides additional body shape information, however, when the hand moves near the body, it is difficult to distinguish hands in the depth map. In contrast, the surface normal provides a more detailed label capturing both the body shape, skin texture, as well as the surface texture of clothes. But the surface normal is sensitive to motion blur. Particularly, the outputs of different annotated modalities may conflict (marked with yellow boxes), which frequently occurs with the skeleton and depth map.
Figure 2: Our data pre-processing pipeline consists of two steps, data cleansing and data relabelling. In the data cleansing stage, based on the prediction of the OpenPose annotator, the frames of motion-blurred hands are filtered, leaving clear frames for the second stage of annotation. Then in the data annotation stage, the latest DWPose and Omnidata annotators are used to annotate the clear frames to get the corresponding modality annotations.
Figure 3: The model training pipeline with the adaptive multi-control fusion network and Region-Aware Cycle Loss (RACL). The weight prediction module predicts the adaptive weights based on different control features input, and the features for fusion are obtained by weighted summation of these features. We use DWPose to obtain the keypoint coordinates of the generated image and the ground truth frame. Then the Euclidean distance for these keypoints is calculated, weighted and summed as the RACL. The combination of the RACL and MSE loss enhances the learning of hand features in the model training. Note, at inference, no ground truth data are input.
Figure 4: Qualitative performance comparison of the proposed method with different methods for conditional control. Gradual deterioration of qualitative performance can be observed from top to bottom row. It can be seen that the proposed adaptive multi-modal control fusion with RACL performs better hand gestures control and image quality than other methods. Please zoom in for details.
Figure 5: Qualitative performance ablation studies of surface normal input and the proposed adaptive multi-modal fusion method with (w) or without (wo) RACL. It can be seen that RACL has impact on the skeletal information of the hand and face. The multimodal fusion module also has an effect on background generation.

Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss

TL;DR

Abstract

Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss

Authors

TL;DR

Abstract

Table of Contents

Figures (5)