Table of Contents
Fetching ...

Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars

Marcel C. Bühler, Ye Yuan, Xueting Li, Yangyi Huang, Koki Nagano, Umar Iqbal

TL;DR

Dream, Lift, Animate (DLA) tackles animatable 3D avatar reconstruction from a single image by coupling diffusion-based multi-view hallucination with a two-stage Gaussian lifting and a UV-space latent mapping grounded in SMPL-X. A transformer encoder converts unstructured 3D Gaussians into a UV-aligned latent code $\mathbf{Z}$, which a Gaussian Parameter Decoder outputs a UV Gaussian map $\mathbf{F}$ that supports pose- and view-conditioned deformation via SMPL-X linear blend skinning. The method achieves real-time rendering, enables intuitive editing, and delivers state-of-the-art results on ActorsHQ and 4D-Dress in both perceptual quality and photometric accuracy, effectively bridging unstructured 3D representations with animation-ready avatars. While powerful, it recognizes potential societal risks such as identity misuse and deepfakes, and points to future work on in-the-wild training and robust governance to maximize beneficial impact.

Abstract

We introduce Dream, Lift, Animate (DLA), a novel framework that reconstructs animatable 3D human avatars from a single image. This is achieved by leveraging multi-view generation, 3D Gaussian lifting, and pose-aware UV-space mapping of 3D Gaussians. Given an image, we first dream plausible multi-views using a video diffusion model, capturing rich geometric and appearance details. These views are then lifted into unstructured 3D Gaussians. To enable animation, we propose a transformer-based encoder that models global spatial relationships and projects these Gaussians into a structured latent representation aligned with the UV space of a parametric body model. This latent code is decoded into UV-space Gaussians that can be animated via body-driven deformation and rendered conditioned on pose and viewpoint. By anchoring Gaussians to the UV manifold, our method ensures consistency during animation while preserving fine visual details. DLA enables real-time rendering and intuitive editing without requiring post-processing. Our method outperforms state-of-the-art approaches on the ActorsHQ and 4D-Dress datasets in both perceptual quality and photometric accuracy. By combining the generative strengths of video diffusion models with a pose-aware UV-space Gaussian mapping, DLA bridges the gap between unstructured 3D representations and high-fidelity, animation-ready avatars.

Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars

TL;DR

Dream, Lift, Animate (DLA) tackles animatable 3D avatar reconstruction from a single image by coupling diffusion-based multi-view hallucination with a two-stage Gaussian lifting and a UV-space latent mapping grounded in SMPL-X. A transformer encoder converts unstructured 3D Gaussians into a UV-aligned latent code , which a Gaussian Parameter Decoder outputs a UV Gaussian map that supports pose- and view-conditioned deformation via SMPL-X linear blend skinning. The method achieves real-time rendering, enables intuitive editing, and delivers state-of-the-art results on ActorsHQ and 4D-Dress in both perceptual quality and photometric accuracy, effectively bridging unstructured 3D representations with animation-ready avatars. While powerful, it recognizes potential societal risks such as identity misuse and deepfakes, and points to future work on in-the-wild training and robust governance to maximize beneficial impact.

Abstract

We introduce Dream, Lift, Animate (DLA), a novel framework that reconstructs animatable 3D human avatars from a single image. This is achieved by leveraging multi-view generation, 3D Gaussian lifting, and pose-aware UV-space mapping of 3D Gaussians. Given an image, we first dream plausible multi-views using a video diffusion model, capturing rich geometric and appearance details. These views are then lifted into unstructured 3D Gaussians. To enable animation, we propose a transformer-based encoder that models global spatial relationships and projects these Gaussians into a structured latent representation aligned with the UV space of a parametric body model. This latent code is decoded into UV-space Gaussians that can be animated via body-driven deformation and rendered conditioned on pose and viewpoint. By anchoring Gaussians to the UV manifold, our method ensures consistency during animation while preserving fine visual details. DLA enables real-time rendering and intuitive editing without requiring post-processing. Our method outperforms state-of-the-art approaches on the ActorsHQ and 4D-Dress datasets in both perceptual quality and photometric accuracy. By combining the generative strengths of video diffusion models with a pose-aware UV-space Gaussian mapping, DLA bridges the gap between unstructured 3D representations and high-fidelity, animation-ready avatars.

Paper Structure

This paper contains 29 sections, 3 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: We propose Dream, Lift, Animate, a novel framework to reconstruct high-fidelity, animatable 3D human avatars from a single image by generating multi-view images, lifting them to 3D Gaussians, and mapping them to a pose-aware UV space (Fig. \ref{['fig:overview']}). Our approach enables realistic animation and outperforms prior methods in visual quality (Fig. \ref{['fig:comp_ahq_light']} and Tbl. \ref{['tbl:comp_ahq_light']}). Watch videos and more at https://research.nvidia.com/labs/dair/dream-lift-animate.
  • Figure 2: Overview of the proposed Dream, Lift, Animate (DLA) framework for reconstructing animatable 3D human avatars from a single image. In the Dream stage (Sec. \ref{['sec:dream']}), we synthesize novel views from the input using a diffusion-based generator. In the Lift stage (Sec. \ref{['sec:lift']} and Fig. \ref{['fig:encoder']}), we project the multi-view images into a set of unstructured 3D Gaussians in the pose space using a learned Gaussian reconstruction model $\mathcal{G}$. Subsequently, we learn a transformer encoder $\mathcal{F}$ to map 3D Gaussians to a structured latent code $\mathbf{Z}$ in the UV space of a parametric body model. In the Animate stage (Sec. \ref{['sec:animate']} and Fig. \ref{['fig:gpd']}), we decode the avatar code into a pose- and view-aware Gaussian parameter map $\mathbf{F}$. This structured representation enables realistic animation and rendering via deformation with a body model.
  • Figure 3: Lift from multiview images to an avatar latent code. The pose-space reconstruction model produces pixel-aligned Gaussian parameters with corresponding feature maps, denoted pose-space Gaussians and per-Gaussian features. The Gaussians are filtered and subsampled to construct a compact Gaussian feature $\mathbf{X}$ with 2048 Gaussians. These inputs are further processed by a transformer. Specifically, the compact Gaussian feature serves as context (key $\mathbf{K}$ and value $\mathbf{V}$) to a cross-attention layer with queries $\mathbf{Q}$ being the positionally encoded vertex position map $\mathbf{P}$. Finally, the output is reshaped and yields the avatar latent code $\mathbf{Z}$. This figure omits linear projections, skip-connections, and positional encoding for improved readability.
  • Figure 4: Animate. The Gaussian Parameter Decoder (GPD, Sec. \ref{['sec:gpd']}) maps a UV-aligned latent $\mathbf{Z}$ to an animatable 3D Gaussian representation. The GPD upsamples the avatar latent code $\mathbf{Z}$ and produces two output maps: a canonical Gaussian map $\mathbf{F}_c$ and an offset map $\mathbf{F}_\Delta$. The offset map $\mathbf{F}_\Delta$ adds pose- and view-dependent offsets to the canonical Gaussian $\mathbf{F}_c$, enabling pose- and view-dependent effects. Given a pose $\Theta$ and camera $\pi$, the Gaussians are deformed with linear blend skinning pavlakos2019smplx and rasterized to an RGBA image kerbl20233d.
  • Figure 5: Comparison for novel view synthesis with DreamGaussian dreamgaussian, SiTH ho2024sith, SIFU zhang2023sifu, and IDOL li2023instant3d. Tbl. \ref{['tbl:comp_ahq_light']} lists metrics.
  • ...and 11 more figures