Human Pose-Constrained UV Map Estimation

Matej Suchanek; Miroslav Purkrabek; Jiri Matas

Human Pose-Constrained UV Map Estimation

Matej Suchanek, Miroslav Purkrabek, Jiri Matas

TL;DR

PC-CSE tackles UV map estimation by enforcing global anatomical plausibility via pose-conditioned proximal regions. It extends the Continuous Surface Embeddings (CSE) framework by incorporating 2D pose without retraining, yielding more coherent UV maps and reducing artifacts. Evaluations on DensePose COCO show consistent gains across multiple pose estimators, with further improvements using full-body skeletons for hands and feet; however, gains are limited by segmentation and ground-truth annotation issues. This approach highlights the potential and limits of using 2D pose as a global constraint for detailed texture mapping.

Abstract

UV map estimation is used in computer vision for detailed analysis of human posture or activity. Previous methods assign pixels to body model vertices by comparing pixel descriptors independently, without enforcing global coherence or plausibility in the UV map. We propose Pose-Constrained Continuous Surface Embeddings (PC-CSE), which integrates estimated 2D human pose into the pixel-to-vertex assignment process. The pose provides global anatomical constraints, ensuring that UV maps remain coherent while preserving local precision. Evaluation on DensePose COCO demonstrates consistent improvement, regardless of the chosen 2D human pose model. Whole-body poses offer better constraints by incorporating additional details about the hands and feet. Conditioning UV maps with human pose reduces invalid mappings and enhances anatomical plausibility. In addition, we highlight inconsistencies in the ground-truth annotations.

Human Pose-Constrained UV Map Estimation

TL;DR

Abstract

Paper Structure (12 sections, 6 equations, 7 figures, 1 table)

This paper contains 12 sections, 6 equations, 7 figures, 1 table.

Introduction
Related Work
Method
Conditioning CSE by pose
Determining proximal regions
Data
Assessing the quality of annotations
Experiments
Evaluation metrics
Results
Ablation study
Conclusions

Figures (7)

Figure 1: The Continuous Surface Embedding method (CSE) DensePoseCSE (left) vs. Pose-Constrained CSE (right). The CSE method assigns each pixel of body segmentation to a vertex, and thus UV coordinate, on a canonical body shape mesh. The CSE assigns each pixel independently, leading to artifacts such as limb duplication (yellow circles). PC-CSE uses pose constraints during UV map estimation, producing smoother maps and eliminating artifacts. The UV values at individual pixels are visualized by color coding. The location of a given color on the canonical surface is shown in the inset image at the top left.
Figure 2: Pose-constrained CSE (PC-CSE) takes an estimated bounding box, segmentation mask, and 2D human pose (a) as input. It computes proximal regions (b) for each body part and assigns pixels to SMPL SMPL vertices to generate a UV map. Unlike the CSE DensePoseCSE, PC-CSE constrains pixel assignments using proximal regions, ensuring the resulting UV map aligns with the estimated pose (c).
Figure 3: CSE DensePoseCSE (left) vs. PC-CSE conditioned by estimated pose (right). Pose constraints ensure smoother UV maps and prevent limb duplication within a single image. A frontal view of the SMPL model SMPL is shown to help assess the UV estimation.
Figure 4: Ablation on bone width $\Delta$ defined in \ref{['sec:method-areas']}. RTMPose-l wb RTMPose is used for pose constraints. Too thin bones restrict UV Map too much and hinder performance on border pixels. Excessively thick bone estimates do not restrict UV Map sufficiently and reduce the performance gain. Note that performance with proximal regions with large regions $\Delta$ converge to the baseline method. In the extreme case when all bones are as big as the whole picture, no constraints are applied. The best value is 0.08.
Figure 5: Ablation on height estimation. We infer pose from a dance video from TikTokDataset at 10 frames per second and estimate the dancing person's height in pixels (red) using the algorithm in \ref{['sec:method-areas']}. The variable exhibits some noise due to pose changes, but remains within the interval of a few tens of pixels at all times. The bigger noise at the end of the video is caused by more extreme poses.
...and 2 more figures

Human Pose-Constrained UV Map Estimation

TL;DR

Abstract

Human Pose-Constrained UV Map Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)