Table of Contents
Fetching ...

TexVocab: Texture Vocabulary-conditioned Human Avatars

Yuxiao Liu, Zhe Li, Yebin Liu, Haoqian Wang

TL;DR

TexVocab addresses the challenge of generating high-fidelity animatable human avatars from multi-view RGB videos by introducing a texture vocabulary tied to pose-conditioned texture maps. It constructs texture maps by back-projecting images onto the posed SMPL surface and mapping them to a fixed SMPL UV domain, then learns a body-part–wise embedding to capture pose-dependent texture changes. Pose features are queried via KNN over key body parts, interpolated with skinning-aware attention, and used to condition a NeRF decoder for dynamic appearances. Experiments across THUman4.0, ZJU-MoCap, and DeepCap demonstrate state-of-the-art quality and robust pose generalization, with ablations confirming the benefits of body-part–wise encoding and multi-view texture maps; limitations include reliance on dense views and SMPL-based clothing representations.

Abstract

To adequately utilize the available image evidence in multi-view video-based avatar modeling, we propose TexVocab, a novel avatar representation that constructs a texture vocabulary and associates body poses with texture maps for animation. Given multi-view RGB videos, our method initially back-projects all the available images in the training videos to the posed SMPL surface, producing texture maps in the SMPL UV domain. Then we construct pairs of human poses and texture maps to establish a texture vocabulary for encoding dynamic human appearances under various poses. Unlike the commonly used joint-wise manner, we further design a body-part-wise encoding strategy to learn the structural effects of the kinematic chain. Given a driving pose, we query the pose feature hierarchically by decomposing the pose vector into several body parts and interpolating the texture features for synthesizing fine-grained human dynamics. Overall, our method is able to create animatable human avatars with detailed and dynamic appearances from RGB videos, and the experiments show that our method outperforms state-of-the-art approaches. The project page can be found at https://texvocab.github.io/.

TexVocab: Texture Vocabulary-conditioned Human Avatars

TL;DR

TexVocab addresses the challenge of generating high-fidelity animatable human avatars from multi-view RGB videos by introducing a texture vocabulary tied to pose-conditioned texture maps. It constructs texture maps by back-projecting images onto the posed SMPL surface and mapping them to a fixed SMPL UV domain, then learns a body-part–wise embedding to capture pose-dependent texture changes. Pose features are queried via KNN over key body parts, interpolated with skinning-aware attention, and used to condition a NeRF decoder for dynamic appearances. Experiments across THUman4.0, ZJU-MoCap, and DeepCap demonstrate state-of-the-art quality and robust pose generalization, with ablations confirming the benefits of body-part–wise encoding and multi-view texture maps; limitations include reliance on dense views and SMPL-based clothing representations.

Abstract

To adequately utilize the available image evidence in multi-view video-based avatar modeling, we propose TexVocab, a novel avatar representation that constructs a texture vocabulary and associates body poses with texture maps for animation. Given multi-view RGB videos, our method initially back-projects all the available images in the training videos to the posed SMPL surface, producing texture maps in the SMPL UV domain. Then we construct pairs of human poses and texture maps to establish a texture vocabulary for encoding dynamic human appearances under various poses. Unlike the commonly used joint-wise manner, we further design a body-part-wise encoding strategy to learn the structural effects of the kinematic chain. Given a driving pose, we query the pose feature hierarchically by decomposing the pose vector into several body parts and interpolating the texture features for synthesizing fine-grained human dynamics. Overall, our method is able to create animatable human avatars with detailed and dynamic appearances from RGB videos, and the experiments show that our method outperforms state-of-the-art approaches. The project page can be found at https://texvocab.github.io/.
Paper Structure (14 sections, 10 equations, 8 figures, 4 tables)

This paper contains 14 sections, 10 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of TexVocab. Given multi-view RGB videos of one character, we construct a texture vocabulary, and create realistic animatable human avatars.
  • Figure 2: Framework of TexVocab. We first construct TexVocab by decomposing SMPL poses into body parts, sampling key body parts and gathering corresponding texture maps. Then given a query pose and a 3D coordinate, we decompose the pose into body parts, interpolate key body parts and sample texture maps as the pose conditioned feature. We finally utilize NeRF represented as an MLP to decode the dynamic character and render human appearance with detailed pose-dependent dynamics.
  • Figure 3: Overview of texture map preparation. First, we back-project all the available pixels to the posed SMPL mesh. Then we convert the projected points on the SMPL mesh to a particular UV domain. Finally, we gather and average all the available pixels, and obtain texture maps based on multi-view images.
  • Figure 4: We decompose the SMPL skeletons into several body parts. Joints with the same color belong to the same body part.
  • Figure 5: Qualitative comparisons against TAVA, ARAH, AniNeRF and PoseVocab. We evaluate methods on THUman4.0 dataset and DeepCap dataset and show the animation results on both training poses and novel poses respectively.
  • ...and 3 more figures