ShapeUP: Scalable Image-Conditioned 3D Editing
Inbar Gat, Dana Cohen-Bar, Guy Levy, Elad Richardson, Daniel Cohen-Or
TL;DR
ShapeUP tackles the challenge of editing 3D assets while preserving identity and consistency by reframing edits as supervised latent-to-latent translations inside a native 3D diffusion backbone. It decouples geometry and texture into a two-stage pipeline conditioned on the source shape and an edited image, using LoRA-based conditioning and a data-driven approach that incorporates Distant Frames in Motion (DFM) to enable global edits. A new synthetic dataset and a dedicated benchmark enable evaluation of global and local edits, with ablations showing the benefits of 1024 latent tokens and DFM data for identity preservation and pose changes. The results indicate ShapeUP achieves superior edit fidelity and occluded-region preservation compared to training-free and learned baselines, offering a scalable path for native 3D content editing without optimization or explicit masks.
Abstract
Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling. In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training. In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset. Our extensive evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering a robust and scalable paradigm for native 3D content creation.
