Table of Contents
Fetching ...

TEDRA: Text-based Editing of Dynamic and Photoreal Actors

Basavaraj Sunagad, Heming Zhu, Mohit Mendiratta, Adam Kortylewski, Christian Theobalt, Marc Habermann

TL;DR

TEDRA addresses the challenge of text-guided edits for dynamic, photoreal 3D human avatars by integrating a TriHuman-based drivable representation with a personalized diffusion model. The core contributions are Personalized Normal-Aligned Score Distillation Sampling (PNA-SDS) and windowed timestep annealing, enabling pose- and view-consistent edits that preserve identity and wrinkles while following textual prompts. Extensive experiments on multi-view data show superior 3D consistency, visual fidelity, and temporal coherence compared with state-of-the-art methods, supported by a user study and quantitative metrics like CLIP similarity and FID. The approach empowers intuitive, high-fidelity avatar editing for applications in AR/VR, film, and synthetic data generation, albeit with compute-intensive training and multi-view data requirements that motivate future work toward efficiency and monocular setups.

Abstract

Over the past years, significant progress has been made in creating photorealistic and drivable 3D avatars solely from videos of real humans. However, a core remaining challenge is the fine-grained and user-friendly editing of clothing styles by means of textual descriptions. To this end, we present TEDRA, the first method allowing text-based edits of an avatar, which maintains the avatar's high fidelity, space-time coherency, as well as dynamics, and enables skeletal pose and view control. We begin by training a model to create a controllable and high-fidelity digital replica of the real actor. Next, we personalize a pretrained generative diffusion model by fine-tuning it on various frames of the real character captured from different camera angles, ensuring the digital representation faithfully captures the dynamics and movements of the real person. This two-stage process lays the foundation for our approach to dynamic human avatar editing. Utilizing this personalized diffusion model, we modify the dynamic avatar based on a provided text prompt using our Personalized Normal Aligned Score Distillation Sampling (PNA-SDS) within a model-based guidance framework. Additionally, we propose a time step annealing strategy to ensure high-quality edits. Our results demonstrate a clear improvement over prior work in functionality and visual quality.

TEDRA: Text-based Editing of Dynamic and Photoreal Actors

TL;DR

TEDRA addresses the challenge of text-guided edits for dynamic, photoreal 3D human avatars by integrating a TriHuman-based drivable representation with a personalized diffusion model. The core contributions are Personalized Normal-Aligned Score Distillation Sampling (PNA-SDS) and windowed timestep annealing, enabling pose- and view-consistent edits that preserve identity and wrinkles while following textual prompts. Extensive experiments on multi-view data show superior 3D consistency, visual fidelity, and temporal coherence compared with state-of-the-art methods, supported by a user study and quantitative metrics like CLIP similarity and FID. The approach empowers intuitive, high-fidelity avatar editing for applications in AR/VR, film, and synthetic data generation, albeit with compute-intensive training and multi-view data requirements that motivate future work toward efficiency and monocular setups.

Abstract

Over the past years, significant progress has been made in creating photorealistic and drivable 3D avatars solely from videos of real humans. However, a core remaining challenge is the fine-grained and user-friendly editing of clothing styles by means of textual descriptions. To this end, we present TEDRA, the first method allowing text-based edits of an avatar, which maintains the avatar's high fidelity, space-time coherency, as well as dynamics, and enables skeletal pose and view control. We begin by training a model to create a controllable and high-fidelity digital replica of the real actor. Next, we personalize a pretrained generative diffusion model by fine-tuning it on various frames of the real character captured from different camera angles, ensuring the digital representation faithfully captures the dynamics and movements of the real person. This two-stage process lays the foundation for our approach to dynamic human avatar editing. Utilizing this personalized diffusion model, we modify the dynamic avatar based on a provided text prompt using our Personalized Normal Aligned Score Distillation Sampling (PNA-SDS) within a model-based guidance framework. Additionally, we propose a time step annealing strategy to ensure high-quality edits. Our results demonstrate a clear improvement over prior work in functionality and visual quality.
Paper Structure (26 sections, 8 equations, 16 figures, 3 tables)

This paper contains 26 sections, 8 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: We propose a method for text-based editing of dynamic and photoreal actors (TEDRA). Our approach edits a pre-trained neural 3D human avatar according to a user-defined text prompt. Importantly, we preserve the original dynamics and view consistency of the digital avatar while also satisfying the desired edit.
  • Figure 1: More ablative results on the ControlNet conditioning scale, and the windowed root annealing method, please zoom-in to see the details.
  • Figure 2: An overview of our approach for text-driven editing of dynamic and photoreal avatars (TEDRA). Our approach starts with a pre-trained TriHuman model as the base human representation. Then, we leverage a fine-tuned diffusion model in conjunction with our proposed Personalized Normal Aligned Score Distillation Sampling (PNA-SDS). The PNA-SDS technique then computes a normal aligned model-based score distillation sampling loss to optimize the human representation towards the edit prompt while preserving the subject's characteristics. This process is further enhanced by incorporating an annealing mechanism, which gradually refines the editing process.
  • Figure 2: The figure shows the annealing of timesteps using the proposed window-root timestep annealing strategy for 10k iterations. The timestep $t$ is randomly sampled within the shown window. As per Eq. 7 if $t > k$ Then the scores from both pre-trained LDM and personalized LDM are used else only the scores from pre-trained LDM are used.
  • Figure 3: Qualitative Results. We present the text-based visual editing results and the underlying geometry. Our method generates compelling text-driven visual edits, ensuring 3D and temporal consistency while altering appearance and geometry. We recommend the readers to zoom in for better viewing of the details.
  • ...and 11 more figures