Table of Contents
Fetching ...

CloseUpAvatar: High-Fidelity Animatable Full-Body Avatars with Mixture of Multi-Scale Textures

David Svitov, Pietro Morerio, Lourdes Agapito, Alessio Del Bue

TL;DR

CloseUpAvatar introduces a full-body avatar representation built from textured billboards that carry two levels of texture detail (MST^L and MST^H). A Mixture of Multi-Scale Textures blends these levels with a camera-distance–dependent coefficient, enabling high-fidelity close-ups while preserving efficiency for distant views. The approach uses billboard splatting, SMPL-X alignment, and a carefully designed training regime with camera augmentation and regularization to converge across diverse viewpoints, achieving state-of-the-art perceptual metrics at close and far views with real-time performance. Experiments on ActorsHQ demonstrate strong qualitative and quantitative gains, though limitations remain in non-rigid finger/face details guidance and extreme deformations of large primitives.

Abstract

We present a CloseUpAvatar - a novel approach for articulated human avatar representation dealing with more general camera motions, while preserving rendering quality for close-up views. CloseUpAvatar represents an avatar as a set of textured planes with two sets of learnable textures for low and high-frequency detail. The method automatically switches to high-frequency textures only for cameras positioned close to the avatar's surface and gradually reduces their impact as the camera moves farther away. Such parametrization of the avatar enables CloseUpAvatar to adjust rendering quality based on camera distance ensuring realistic rendering across a wider range of camera orientations than previous approaches. We provide experiments using the ActorsHQ dataset with high-resolution input images. CloseUpAvatar demonstrates both qualitative and quantitative improvements over existing methods in rendering from novel wide range camera positions, while maintaining high FPS by limiting the number of required primitives.

CloseUpAvatar: High-Fidelity Animatable Full-Body Avatars with Mixture of Multi-Scale Textures

TL;DR

CloseUpAvatar introduces a full-body avatar representation built from textured billboards that carry two levels of texture detail (MST^L and MST^H). A Mixture of Multi-Scale Textures blends these levels with a camera-distance–dependent coefficient, enabling high-fidelity close-ups while preserving efficiency for distant views. The approach uses billboard splatting, SMPL-X alignment, and a carefully designed training regime with camera augmentation and regularization to converge across diverse viewpoints, achieving state-of-the-art perceptual metrics at close and far views with real-time performance. Experiments on ActorsHQ demonstrate strong qualitative and quantitative gains, though limitations remain in non-rigid finger/face details guidance and extreme deformations of large primitives.

Abstract

We present a CloseUpAvatar - a novel approach for articulated human avatar representation dealing with more general camera motions, while preserving rendering quality for close-up views. CloseUpAvatar represents an avatar as a set of textured planes with two sets of learnable textures for low and high-frequency detail. The method automatically switches to high-frequency textures only for cameras positioned close to the avatar's surface and gradually reduces their impact as the camera moves farther away. Such parametrization of the avatar enables CloseUpAvatar to adjust rendering quality based on camera distance ensuring realistic rendering across a wider range of camera orientations than previous approaches. We provide experiments using the ActorsHQ dataset with high-resolution input images. CloseUpAvatar demonstrates both qualitative and quantitative improvements over existing methods in rendering from novel wide range camera positions, while maintaining high FPS by limiting the number of required primitives.

Paper Structure

This paper contains 15 sections, 10 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Method description. We learn the surfels' position and orientation in the canonical space and transform them to the target pose using Linear Blend Skinning (LBS). Each surfel $i$ has two learnable Multi-Scale Texture (MST), that store four channels RGB+$\alpha$ information for low $\textrm{MST}^L$ and high $\textrm{MST}^H$ frequency details. We store them as two tensors with corresponding textures and sample the values for surfel $i$ in $[u, v]$ coordinates. The final color of the surfel is calculated as a weighted sum with a view-dependent coefficient $\omega$. Coefficient $\omega$ controls the amount of high-frequency details based on the camera distance.
  • Figure 2: Cameras augmentation. We augment camera positions by cropping and padding dataset images to produce closer and farther views. We also modified the camera matrices as described in \ref{['sec:method_cameras']} to ensure that rendering will match the modified dataset images.
  • Figure 3: Camera augmentation effect. a) Camera augmentation described in \ref{['sec:method_cameras']} negatively affects the convergence of the Gaussian-based Mmlphuman zhan2025real method, resulting in rendering artifacts and significant objective metrics reduction. b) For our approach, camera augmentation leads to sharper results but can cause a slight pixel-level metrics reduction due to its sensitivity to alignment with ground truth.
  • Figure 4: Qualitative comparison for varying cameras. Comparison of rendering quality in novel poses for novel-view synthesis across different camera distances.
  • Figure 5: Novel poses. Showcase of our avatars in novel poses from the AMASS mahmood2019amass dataset.
  • ...and 4 more figures