Table of Contents
Fetching ...

AvatarPointillist: AutoRegressive 4D Gaussian Avatarization

Hongyu Liu, Xuan Wang, Yating Wang, Zijian Wu, Ziyu Wan, Yue Ma, Runtao Liu, Boyao Zhou, Yujun Shen, Qifeng Chen

Abstract

We introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is a decoder-only Transformer that autoregressively generates a point cloud for 3D Gaussian Splatting. This sequential approach allows for precise, adaptive construction, dynamically adjusting point density and the total number of points based on the subject's complexity. During point generation, the AR model also jointly predicts per-point binding information, enabling realistic animation. After generation, a dedicated Gaussian decoder converts the points into complete, renderable Gaussian attributes. We demonstrate that conditioning the decoder on the latent features from the AR generator enables effective interaction between stages and markedly improves fidelity. Extensive experiments validate that AvatarPointillist produces high-quality, photorealistic, and controllable avatars. We believe this autoregressive formulation represents a new paradigm for avatar generation, and we will release our code inspire future research.

AvatarPointillist: AutoRegressive 4D Gaussian Avatarization

Abstract

We introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is a decoder-only Transformer that autoregressively generates a point cloud for 3D Gaussian Splatting. This sequential approach allows for precise, adaptive construction, dynamically adjusting point density and the total number of points based on the subject's complexity. During point generation, the AR model also jointly predicts per-point binding information, enabling realistic animation. After generation, a dedicated Gaussian decoder converts the points into complete, renderable Gaussian attributes. We demonstrate that conditioning the decoder on the latent features from the AR generator enables effective interaction between stages and markedly improves fidelity. Extensive experiments validate that AvatarPointillist produces high-quality, photorealistic, and controllable avatars. We believe this autoregressive formulation represents a new paradigm for avatar generation, and we will release our code inspire future research.

Paper Structure

This paper contains 25 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Comparison of different Gaussian point cloud modeling approaches. LAM he2025lam constructs Gaussian point clouds based on a point cloud template, which fails to reconstruct fine details from the image, such as ponytails. In contrast, our method utilizes an AR model to directly model the Gaussian point cloud. It effectively learns the capability to adaptively adjust point density and count, enabling precise modeling. Moreover, we also include final rendering results for comparison. LAM produces distorted geometry and shows noticeable artifacts.
  • Figure 2: Overview of our framework. It consists of two modules: an autoregressive (AR) model for Gaussian geometry generation and a Gaussian Decoder for predicting rendering attributes. The AR model takes image features from DINOv2 oquab2023dinov2 and point cloud features as input. The point cloud feature extract via Pixel3DMM giebenhain2025pixel3dmm and a point cloud encoder zhao2023michelangelo. The AR model is trained to generate a Gaussian point cloud via next-token prediction, where each point is represented by four quantized tokens $(T_n^x, T_n^y, T_n^z, T_n^b)$ corresponding to coordinates and binding information. After generation, the tokens are de-quantized to obtain the actual coordinates. We then combine the positional embeddings $P_n$ with the internal features $F_n^p$ from the AR model as input to the Gaussian Decoder to predict the final accurate Gaussian attributes. Finally, the result is animated using Linear Blend Skinning (LBS) and the binding information.
  • Figure 3: Qualitative comparison with state-of-the-art methods. The leftmost column shows the input images, with the target image displayed in the bottom-right corner. The first row presents self-reenactment results, while the remaining three rows show cross-reenactment results. Our method demonstrates superior performance in expression and pose consistency, as well as better identity preservation compared to other approaches.
  • Figure 4: Visualization of ablation study on input setting of Gaussian decoder. The leftmost column shows the input. The FLAME Positions baseline, similar to the LAM method, uses the canonical FLAME mesh vertices as a template and only applies decoder-predicted offsets to deform this template into a final Gaussian point cloud. Pointwise AR Feature refers to using only the AR features ($F_n^p$) without positional information, while Positional Encoding uses only the point embeddings ($P_n$) without AR features.