Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling
Shuhong Zheng, Ashkan Mirzaei, Igor Gilitschenski
TL;DR
The paper addresses the challenge of preserving subject identity in 3D/4D generation by introducing TIRE (Track, Inpaint, Resplat), a three-stage pipeline that starts from a rough 3D asset, uses backward video tracking to identify infill regions, progressively inpaints those regions with a subject-driven 2D diffusion model, and finally resplats the infilled textures back to 3D with cross-view consistency. By leveraging 2D tracking and inpainting tools, TIRE achieves improved identity preservation and cross-view coherence compared with state-of-the-art baselines, and demonstrates applicability as a plug-in to diverse 3D/4D representations. The approach includes technical innovations such as backward tracking for accurate masks, LoRA-finetuned inpainting for subject specificity, and mask-aware latent diffusion refinements during resplat. Evaluations on a DreamBooth-Dynamic dataset and in-the-wild data—with DINO-based and VLM-based metrics, plus human user studies—show significant gains in subject fidelity and geometry quality, while also acknowledging limitations in current quantitative evaluation and runtime efficiency.
Abstract
Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at https://zsh2000.github.io/track-inpaint-resplat.github.io/.
