Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars
Marcel C. Bühler, Ye Yuan, Xueting Li, Yangyi Huang, Koki Nagano, Umar Iqbal
TL;DR
Dream, Lift, Animate (DLA) tackles animatable 3D avatar reconstruction from a single image by coupling diffusion-based multi-view hallucination with a two-stage Gaussian lifting and a UV-space latent mapping grounded in SMPL-X. A transformer encoder converts unstructured 3D Gaussians into a UV-aligned latent code $\mathbf{Z}$, which a Gaussian Parameter Decoder outputs a UV Gaussian map $\mathbf{F}$ that supports pose- and view-conditioned deformation via SMPL-X linear blend skinning. The method achieves real-time rendering, enables intuitive editing, and delivers state-of-the-art results on ActorsHQ and 4D-Dress in both perceptual quality and photometric accuracy, effectively bridging unstructured 3D representations with animation-ready avatars. While powerful, it recognizes potential societal risks such as identity misuse and deepfakes, and points to future work on in-the-wild training and robust governance to maximize beneficial impact.
Abstract
We introduce Dream, Lift, Animate (DLA), a novel framework that reconstructs animatable 3D human avatars from a single image. This is achieved by leveraging multi-view generation, 3D Gaussian lifting, and pose-aware UV-space mapping of 3D Gaussians. Given an image, we first dream plausible multi-views using a video diffusion model, capturing rich geometric and appearance details. These views are then lifted into unstructured 3D Gaussians. To enable animation, we propose a transformer-based encoder that models global spatial relationships and projects these Gaussians into a structured latent representation aligned with the UV space of a parametric body model. This latent code is decoded into UV-space Gaussians that can be animated via body-driven deformation and rendered conditioned on pose and viewpoint. By anchoring Gaussians to the UV manifold, our method ensures consistency during animation while preserving fine visual details. DLA enables real-time rendering and intuitive editing without requiring post-processing. Our method outperforms state-of-the-art approaches on the ActorsHQ and 4D-Dress datasets in both perceptual quality and photometric accuracy. By combining the generative strengths of video diffusion models with a pose-aware UV-space Gaussian mapping, DLA bridges the gap between unstructured 3D representations and high-fidelity, animation-ready avatars.
