Table of Contents
Fetching ...

Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling

Shuhong Zheng, Ashkan Mirzaei, Igor Gilitschenski

TL;DR

The paper addresses the challenge of preserving subject identity in 3D/4D generation by introducing TIRE (Track, Inpaint, Resplat), a three-stage pipeline that starts from a rough 3D asset, uses backward video tracking to identify infill regions, progressively inpaints those regions with a subject-driven 2D diffusion model, and finally resplats the infilled textures back to 3D with cross-view consistency. By leveraging 2D tracking and inpainting tools, TIRE achieves improved identity preservation and cross-view coherence compared with state-of-the-art baselines, and demonstrates applicability as a plug-in to diverse 3D/4D representations. The approach includes technical innovations such as backward tracking for accurate masks, LoRA-finetuned inpainting for subject specificity, and mask-aware latent diffusion refinements during resplat. Evaluations on a DreamBooth-Dynamic dataset and in-the-wild data—with DINO-based and VLM-based metrics, plus human user studies—show significant gains in subject fidelity and geometry quality, while also acknowledging limitations in current quantitative evaluation and runtime efficiency.

Abstract

Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at https://zsh2000.github.io/track-inpaint-resplat.github.io/.

Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling

TL;DR

The paper addresses the challenge of preserving subject identity in 3D/4D generation by introducing TIRE (Track, Inpaint, Resplat), a three-stage pipeline that starts from a rough 3D asset, uses backward video tracking to identify infill regions, progressively inpaints those regions with a subject-driven 2D diffusion model, and finally resplats the infilled textures back to 3D with cross-view consistency. By leveraging 2D tracking and inpainting tools, TIRE achieves improved identity preservation and cross-view coherence compared with state-of-the-art baselines, and demonstrates applicability as a plug-in to diverse 3D/4D representations. The approach includes technical innovations such as backward tracking for accurate masks, LoRA-finetuned inpainting for subject specificity, and mask-aware latent diffusion refinements during resplat. Evaluations on a DreamBooth-Dynamic dataset and in-the-wild data—with DINO-based and VLM-based metrics, plus human user studies—show significant gains in subject fidelity and geometry quality, while also acknowledging limitations in current quantitative evaluation and runtime efficiency.

Abstract

Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at https://zsh2000.github.io/track-inpaint-resplat.github.io/.

Paper Structure

This paper contains 30 sections, 8 equations, 22 figures, 4 tables.

Figures (22)

  • Figure 1: State-of-the-art 4D generation model L4GM l4gm_nocode and our solution. (a) L4GM performs video-to-4D generation. The side and back views of the generated 4D asset does not look alike the subject in the given source view. (b) Our proposed solution TIRE (Track, Inpaint, REsplat) adopts the progressive texture infilling paradigm to inpaint the 3D asset to achieve subject-driven 3D/4D generation, which preserves the identity of the generated assets when observing from the novel views.
  • Figure 2: Pipeline of TIRE. TIRE starts from a rough 3D asset created by existing models and its rendered multi-view observations. Afterwards, the three stages Track, Inpaint, Resplat target at identifying the inpainting masks, infilling the occluded regions, and unprojecting back to 3D, respectively.
  • Figure 3: Comparison between forward tracking and backward tracking when identifying the inpainting mask. Forward tracking, which means that the tracking process starts from the given source view to the target views, though being more intuitive, leads to grainy inpainting results. In contrast, backward tracking produces more accurate masks in better shapes, which benefits the following inpainting process.
  • Figure 4: Qualitative comparison on image-to-3D generation with SV3D sv3d, Wonder3D long2024wonder3dsingleimage3d, LGM lgm_hascode, MeshFormer liu2024meshformer, TRELLIS xiang2025structured3dlatentsscalable, and Hunyuan3D-v2.5 lai2025hunyuan3d25highfidelity3d. Compared against other method in the image-to-3D setting, our method better preserves the identity of the reference image, and also reaches superior quality on geometry. It is noticeable that even for the most recent advancements in image-to-3D like TRELLIS and Hunyuan3D-v2.5, the challenge of producing identity-preserving 3D assets is still not well solved.
  • Figure 5: Comparison between our method and the baselines Customize-It-3D huang2024customizeit3dhighquality3dcreation (additional feed-forward operation from L4GM is applied after obtaining multi-view observations to allow it to generate dynamic 3D assets), STAG4D zeng2024stag4d_hascode, and L4GM l4gm_nocode.
  • ...and 17 more figures