ShapeUP: Scalable Image-Conditioned 3D Editing

Inbar Gat; Dana Cohen-Bar; Guy Levy; Elad Richardson; Daniel Cohen-Or

ShapeUP: Scalable Image-Conditioned 3D Editing

Inbar Gat, Dana Cohen-Bar, Guy Levy, Elad Richardson, Daniel Cohen-Or

TL;DR

ShapeUP tackles the challenge of editing 3D assets while preserving identity and consistency by reframing edits as supervised latent-to-latent translations inside a native 3D diffusion backbone. It decouples geometry and texture into a two-stage pipeline conditioned on the source shape and an edited image, using LoRA-based conditioning and a data-driven approach that incorporates Distant Frames in Motion (DFM) to enable global edits. A new synthetic dataset and a dedicated benchmark enable evaluation of global and local edits, with ablations showing the benefits of 1024 latent tokens and DFM data for identity preservation and pose changes. The results indicate ShapeUP achieves superior edit fidelity and occluded-region preservation compared to training-free and learned baselines, offering a scalable path for native 3D content editing without optimization or explicit masks.

Abstract

Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling. In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training. In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset. Our extensive evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering a robust and scalable paradigm for native 3D content creation.

ShapeUP: Scalable Image-Conditioned 3D Editing

TL;DR

Abstract

Paper Structure (50 sections, 2 equations, 12 figures, 2 tables)

This paper contains 50 sections, 2 equations, 12 figures, 2 tables.

Introduction
Related Work
3D Diffusion Models
3D Editing via 2D Lifting
Optimization-based methods.
Multi-view propagation methods.
Native 3D Editing
Training-free methods.
Learned methods.
Method
Geometry Editing
Geometry Generation Backbone
Geometry Editing Pipeline
Texture Editing
Texture Generation Backbone
...and 35 more sections

Figures (12)

Figure 1: Overview. ShapeUP takes a Textured Source Mesh together with a single Edited Image (left). The ShapeUP Geometry module produces an Untextured Edited Mesh by editing the source shape directly in a native 3D latent space, preserving identity and enabling implicit localization. The edited geometry is rendered to obtain Positions + Normals, which guide the ShapeUP Texture module (right) to generate the final Textured Edited Mesh while retaining details from the Source Texture.
Figure 2: ShapeUP Geometry Editing. During inference, the source shape $S_{src}$ is encoded by the Shape Encoder, from which $K$ latent vectors are sampled to form the source geometry conditioning signal. The edited image $I_{edit}$ is encoded by the Image Encoder to provide the target edit conditioning signal. Both signals are concatenated and processed by $N$ identical layers comprising Double Stream, Single Stream, and MLP blocks, with $LoRA$ trained on the Double and Single Stream blocks. The geometry pipeline produces a latent representation of the edited mesh, which is decoded into 3D by the Shape Decoder to obtain the target shape $S_{edit}$.
Figure 3: ShapeUP Texture Editing. Texture editing is conditioned on the edited image, $I_{edit}$, multi-view renders of the source textured mesh , $i^{MV}_{src}$, and multi-view normal and position renders of the edited shape, $G_{edit}$. Deep features are extracted from both $I_{edit}$ and $i^{MV}_{src}$ and fused through cross-attention, while $G_{edit}$ features are incorporated as additive-residuals. The model outputs a set of consistent edited multi-view images, which are subsequently baked onto the edited geometry.
Figure 4: Qualitative Comparisons. We compare our method against 3DEditFormer and EditP23. The first two columns show the source mesh (Front and Back views), followed by the editing condition. Our method (Cols 4--5) offers better condition alignment and preservation of the object's identity compared to the baselines.
Figure 5: User study results. We show the % of participants who preferred our method when compared to each baseline. Error bars are the 95% confidence interval.
...and 7 more figures

ShapeUP: Scalable Image-Conditioned 3D Editing

TL;DR

Abstract

ShapeUP: Scalable Image-Conditioned 3D Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (12)