Table of Contents
Fetching ...

InsTaG: Learning Personalized 3D Talking Head from Few-Second Video

Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Jun Zhou, Lin Gu

TL;DR

InsTaG tackles the data-hungry nature of radiance-field 3D talking head synthesis by decoupling universal motion priors from identity-specific styles. It introduces Identity-Free Pre-training to build a Universal Motion Field and Personalized Fields, and Motion-Aligned Adaptation to quickly tailor unseen identities using a Motion Aligner and Face-Mouth Hook, all within a lightweight 3D Gaussian Splatting framework. The approach achieves high-fidelity, lip-synced personalized heads from as little as five seconds of video with real-time inference, outperforming state-of-the-art methods in both quality and efficiency across diverse identities. This work enables rapid, scalable production of personalized 3D talking heads while addressing safety and ethical considerations of synthetic media.

Abstract

Despite exhibiting impressive performance in synthesizing lifelike personalized 3D talking heads, prevailing methods based on radiance fields suffer from high demands for training data and time for each new identity. This paper introduces InsTaG, a 3D talking head synthesis framework that allows a fast learning of realistic personalized 3D talking head from few training data. Built upon a lightweight 3DGS person-specific synthesizer with universal motion priors, InsTaG achieves high-quality and fast adaptation while preserving high-level personalization and efficiency. As preparation, we first propose an Identity-Free Pre-training strategy that enables the pre-training of the person-specific model and encourages the collection of universal motion priors from long-video data corpus. To fully exploit the universal motion priors to learn an unseen new identity, we then present a Motion-Aligned Adaptation strategy to adaptively align the target head to the pre-trained field, and constrain a robust dynamic head structure under few training data. Experiments demonstrate our outstanding performance and efficiency under various data scenarios to render high-quality personalized talking heads.

InsTaG: Learning Personalized 3D Talking Head from Few-Second Video

TL;DR

InsTaG tackles the data-hungry nature of radiance-field 3D talking head synthesis by decoupling universal motion priors from identity-specific styles. It introduces Identity-Free Pre-training to build a Universal Motion Field and Personalized Fields, and Motion-Aligned Adaptation to quickly tailor unseen identities using a Motion Aligner and Face-Mouth Hook, all within a lightweight 3D Gaussian Splatting framework. The approach achieves high-fidelity, lip-synced personalized heads from as little as five seconds of video with real-time inference, outperforming state-of-the-art methods in both quality and efficiency across diverse identities. This work enables rapid, scalable production of personalized 3D talking heads while addressing safety and ethical considerations of synthetic media.

Abstract

Despite exhibiting impressive performance in synthesizing lifelike personalized 3D talking heads, prevailing methods based on radiance fields suffer from high demands for training data and time for each new identity. This paper introduces InsTaG, a 3D talking head synthesis framework that allows a fast learning of realistic personalized 3D talking head from few training data. Built upon a lightweight 3DGS person-specific synthesizer with universal motion priors, InsTaG achieves high-quality and fast adaptation while preserving high-level personalization and efficiency. As preparation, we first propose an Identity-Free Pre-training strategy that enables the pre-training of the person-specific model and encourages the collection of universal motion priors from long-video data corpus. To fully exploit the universal motion priors to learn an unseen new identity, we then present a Motion-Aligned Adaptation strategy to adaptively align the target head to the pre-trained field, and constrain a robust dynamic head structure under few training data. Experiments demonstrate our outstanding performance and efficiency under various data scenarios to render high-quality personalized talking heads.

Paper Structure

This paper contains 35 sections, 14 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: With only 5-second video data, InsTaG outperforms the state-of-the-arts ye2023genefaceli2024talkinggaussianye2024mimictalk by delivering high-quality personalized lip synchronization and realistic rendering with the fastest adaptation, meanwhile attaining low memory overhead and real-time inference.
  • Figure 2: Overview of InsTaG. For preparation, InsTaG collects the common knowledge of talking motion from a long-video corpus by Identity-Free Pre-training, storing it as a motion field. Given a short video with a new identity, the Motion-Aligned Adaptation strategy builds a robust and fast person-specific synthesizer with the pre-trained motion field to learn a high-quality personalized 3D talking head.
  • Figure 3: Illustration of Face-Mouth Hook. We hook the motion of the mouth to the generated face motion, allowing an alignment across two branches to enhance robustness under few training data.
  • Figure 4: Qualitative Comparison on Synchronization. Our method performs best in both lip-synchronization and visual quality. "Real3DP" and "TG" denote ye2024real3d and li2024talkinggaussian. Better visualized with zoom-in. We recommend watching the supplementary video.
  • Figure 5: Qualitative Comparison on Reconstruction Quality. Our method performs the best in rendering photorealistic talking heads with fine details. "Real3DP" and "TG" denote Real3DPortrait ye2024real3d and TalkingGaussian li2024talkinggaussian. Better visualized with zoom-in.
  • ...and 7 more figures