Table of Contents
Fetching ...

One Shot, One Talk: Whole-body Talking Avatar from a Single Image

Jun Xiang, Yudong Guo, Leipeng Hu, Boyang Guo, Yancheng Yuan, Juyong Zhang

TL;DR

The authors tackle the challenge of building a photorealistic, animatable whole-body avatar from a single image by combining a 3D Gaussian Splatting (3DGS) representation with an SMPL-X mesh in a tightly coupled framework. They generate imperfect pseudo-labels for body and head motion using diffusion-guided methods driven by motion sequences from the TED Gesture dataset, and supervise a hybrid 3DGS-mesh avatar with mesh-centric regularizations and perceptual losses to mitigate inconsistencies. The approach yields a one-shot, highly expressive talking avatar that preserves identity and enables accurate body, hand, and facial animation, outperforming several state-of-the-art methods that rely on video inputs. The work highlights strong potential for practical talking-avatar applications while noting limitations in finger-level accuracy and large-view rendering, suggesting future integration of semantic priors and static 3D information to broaden capability.

Abstract

Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-videos, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.

One Shot, One Talk: Whole-body Talking Avatar from a Single Image

TL;DR

The authors tackle the challenge of building a photorealistic, animatable whole-body avatar from a single image by combining a 3D Gaussian Splatting (3DGS) representation with an SMPL-X mesh in a tightly coupled framework. They generate imperfect pseudo-labels for body and head motion using diffusion-guided methods driven by motion sequences from the TED Gesture dataset, and supervise a hybrid 3DGS-mesh avatar with mesh-centric regularizations and perceptual losses to mitigate inconsistencies. The approach yields a one-shot, highly expressive talking avatar that preserves identity and enables accurate body, hand, and facial animation, outperforming several state-of-the-art methods that rely on video inputs. The work highlights strong potential for practical talking-avatar applications while noting limitations in finger-level accuracy and large-view rendering, suggesting future integration of semantic priors and static 3D information to broaden capability.

Abstract

Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-videos, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.

Paper Structure

This paper contains 15 sections, 10 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Given a one-shot image (e.g., your favorite photo) as input, our method reconstructs a fully expressive whole-body talking avatar that captures personalized details and supports realistic animation, including vivid body gestures and natural expression changes. Project page: https://ustc3dv.github.io/OneShotOneTalk/
  • Figure 2: Overview. Our method constructs an expressive whole-body talking avatar from a single image. We begin by generating pseudo body and head frames using pre-trained generative models, driven by a collected video dataset with diverse poses. Per-pixel supervision on the input image, perceptual supervision on imperfect pseudo labels, and mesh-related constraints are then applied to guide the 3DGS-mesh coupled avatar representation, ensuring realistic and expressive avatar reconstruction and animation.
  • Figure 3: Qualitative comparisons with representative methods mimicmotion2024moon2024exavatarhuang2022elicit in the cross-identity motion reenactment task. Our method achieves accurate and realistic animation with almost all fine details preserved and identity unchanged.
  • Figure 4: Qualitative comparisons with representative methods mimicmotion2024moon2024exavatarhuang2022elicithuang2024make in the self-driven motion reenactment task. Our method well models facial and hand regions, which match the input image most in global identity preservation and local details modeling, even compared with some methods trained on captured videos.
  • Figure 5: More examples of cross-identity pose reenactment. Different subjects can be accurately animated with the same poses.
  • ...and 6 more figures