Table of Contents
Fetching ...

Avatar4D: Synthesizing Domain-Specific 4D Humans for Real-World Pose Estimation

Jerrin Bright, Zhibo Wang, Dmytro Klepachevskyi, Yuhao Chen, Sirisha Rambhatla, David Clausi, John Zelek

TL;DR

The paper addresses the need for domain-specific, labeled 4D human motion data to improve pose estimation. It introduces Avatar4D, a three-stage pipeline that generates controllable, photorealistic 4D humans and a large synthetic sports dataset, Syn2Sport. Through extensive experiments, it demonstrates strong supervised performance, zero-shot transfer to real data, and cross-sport generalization, supported by feature-space alignment analyses. The work shows synthetic data can reduce reliance on real annotations while enabling scalable, transferable human motion modeling for domain-specific tasks.

Abstract

We present Avatar4D, a real-world transferable pipeline for generating customizable synthetic human motion datasets tailored to domain-specific applications. Unlike prior works, which focus on general, everyday motions and offer limited flexibility, our approach provides fine-grained control over body pose, appearance, camera viewpoint, and environmental context, without requiring any manual annotations. To validate the impact of Avatar4D, we focus on sports, where domain-specific human actions and movement patterns pose unique challenges for motion understanding. In this setting, we introduce Syn2Sport, a large-scale synthetic dataset spanning sports, including baseball and ice hockey. Avatar4D features high-fidelity 4D (3D geometry over time) human motion sequences with varying player appearances rendered in diverse environments. We benchmark several state-of-the-art pose estimation models on Syn2Sport and demonstrate their effectiveness for supervised learning, zero-shot transfer to real-world data, and generalization across sports. Furthermore, we evaluate how closely the generated synthetic data aligns with real-world datasets in feature space. Our results highlight the potential of such systems to generate scalable, controllable, and transferable human datasets for diverse domain-specific tasks without relying on domain-specific real data.

Avatar4D: Synthesizing Domain-Specific 4D Humans for Real-World Pose Estimation

TL;DR

The paper addresses the need for domain-specific, labeled 4D human motion data to improve pose estimation. It introduces Avatar4D, a three-stage pipeline that generates controllable, photorealistic 4D humans and a large synthetic sports dataset, Syn2Sport. Through extensive experiments, it demonstrates strong supervised performance, zero-shot transfer to real data, and cross-sport generalization, supported by feature-space alignment analyses. The work shows synthetic data can reduce reliance on real annotations while enabling scalable, transferable human motion modeling for domain-specific tasks.

Abstract

We present Avatar4D, a real-world transferable pipeline for generating customizable synthetic human motion datasets tailored to domain-specific applications. Unlike prior works, which focus on general, everyday motions and offer limited flexibility, our approach provides fine-grained control over body pose, appearance, camera viewpoint, and environmental context, without requiring any manual annotations. To validate the impact of Avatar4D, we focus on sports, where domain-specific human actions and movement patterns pose unique challenges for motion understanding. In this setting, we introduce Syn2Sport, a large-scale synthetic dataset spanning sports, including baseball and ice hockey. Avatar4D features high-fidelity 4D (3D geometry over time) human motion sequences with varying player appearances rendered in diverse environments. We benchmark several state-of-the-art pose estimation models on Syn2Sport and demonstrate their effectiveness for supervised learning, zero-shot transfer to real-world data, and generalization across sports. Furthermore, we evaluate how closely the generated synthetic data aligns with real-world datasets in feature space. Our results highlight the potential of such systems to generate scalable, controllable, and transferable human datasets for diverse domain-specific tasks without relying on domain-specific real data.

Paper Structure

This paper contains 18 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overview of Avatar4D architecture. We sample a random source image ($\text{S}_\text{img}$), source background ($\text{S}_\text{bg}$) and motion sequence ($\text{S}_\text{motion}$) from three dictionaries: $\mathcal{D}_\text{img}$ for domain-relevant person images, $\mathcal{D}_\text{bg}$ for scene/background images, and $\mathcal{D}_\text{motion}$ for motion sequences. These inputs are then combined to generate domain-specific synthetic 4D human data.
  • Figure 2: Architecture of Avatar4D. The proposed pipeline consists of three key stages. First, a motion sequence is constructed from expert demonstrations collected from online sources, producing 3D poses along with corresponding camera parameters. Next, canonical 3D human assets are generated from sampled source person images. Finally, in the human-scene composition stage, these assets are deformed according to the 3D poses and rendered against sampled source background.
  • Figure 3: Sample images from the Syn2Sport dataset. All samples were generated using the proposed Avatar4D pipeline.
  • Figure 4: t-SNE visualization of feature embeddings. Real samples (red) and synthetic samples (blue) show substantial overlap, indicating strong similarity between the datasets.
  • Figure 5: Background generation using different methods. Samples generated using ICLight, ControlNet, and static backgrounds, illustrating their impact on the realism of the synthetic human data.
  • ...and 2 more figures