Table of Contents
Fetching ...

VecSet-Edit: Unleashing Pre-trained LRM for Mesh Editing from Single Image

Teng-Fang Hsiao, Bo-Kai Ruan, Yu-Lun Liu, Hong-Han Shuai

TL;DR

VecSet-Edit tackles localized mesh editing guided by a single image by leveraging a VecSet Large Reconstruction Model (LRM) as a backbone. It introduces a token-level localization strategy—Mask-guided Token Seeding and Attention-aligned Token Gating—paired with Drift-aware Token Pruning to prevent diffusion-induced artifacts, enabling region-specific edits while preserving the rest of the mesh. A Detail-preserving Texture Baking step ensures high-frequency appearance details remain intact in unedited regions. Empirical results on Edit3D-Bench show superior preservation, sharper condition alignment, and faster performance compared with voxel- and multi-view-based baselines, bridging high-fidelity LRM generation with precise, production-ready mesh editing from a single image.

Abstract

3D editing has emerged as a critical research area to provide users with flexible control over 3D assets. While current editing approaches predominantly focus on 3D Gaussian Splatting or multi-view images, the direct editing of 3D meshes remains underexplored. Prior attempts, such as VoxHammer, rely on voxel-based representations that suffer from limited resolution and necessitate labor-intensive 3D mask. To address these limitations, we propose \textbf{VecSet-Edit}, the first pipeline that leverages the high-fidelity VecSet Large Reconstruction Model (LRM) as a backbone for mesh editing. Our approach is grounded on a analysis of the spatial properties in VecSet tokens, revealing that token subsets govern distinct geometric regions. Based on this insight, we introduce Mask-guided Token Seeding and Attention-aligned Token Gating strategies to precisely localize target regions using only 2D image conditions. Also, considering the difference between VecSet diffusion process versus voxel we design a Drift-aware Token Pruning to reject geometric outliers during the denoising process. Finally, our Detail-preserving Texture Baking module ensures that we not only preserve the geometric details of original mesh but also the textural information. More details can be found in our project page: https://github.com/BlueDyee/VecSet-Edit/tree/main

VecSet-Edit: Unleashing Pre-trained LRM for Mesh Editing from Single Image

TL;DR

VecSet-Edit tackles localized mesh editing guided by a single image by leveraging a VecSet Large Reconstruction Model (LRM) as a backbone. It introduces a token-level localization strategy—Mask-guided Token Seeding and Attention-aligned Token Gating—paired with Drift-aware Token Pruning to prevent diffusion-induced artifacts, enabling region-specific edits while preserving the rest of the mesh. A Detail-preserving Texture Baking step ensures high-frequency appearance details remain intact in unedited regions. Empirical results on Edit3D-Bench show superior preservation, sharper condition alignment, and faster performance compared with voxel- and multi-view-based baselines, bridging high-fidelity LRM generation with precise, production-ready mesh editing from a single image.

Abstract

3D editing has emerged as a critical research area to provide users with flexible control over 3D assets. While current editing approaches predominantly focus on 3D Gaussian Splatting or multi-view images, the direct editing of 3D meshes remains underexplored. Prior attempts, such as VoxHammer, rely on voxel-based representations that suffer from limited resolution and necessitate labor-intensive 3D mask. To address these limitations, we propose \textbf{VecSet-Edit}, the first pipeline that leverages the high-fidelity VecSet Large Reconstruction Model (LRM) as a backbone for mesh editing. Our approach is grounded on a analysis of the spatial properties in VecSet tokens, revealing that token subsets govern distinct geometric regions. Based on this insight, we introduce Mask-guided Token Seeding and Attention-aligned Token Gating strategies to precisely localize target regions using only 2D image conditions. Also, considering the difference between VecSet diffusion process versus voxel we design a Drift-aware Token Pruning to reject geometric outliers during the denoising process. Finally, our Detail-preserving Texture Baking module ensures that we not only preserve the geometric details of original mesh but also the textural information. More details can be found in our project page: https://github.com/BlueDyee/VecSet-Edit/tree/main
Paper Structure (42 sections, 29 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 42 sections, 29 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: Overview of the VecSet-Edit framework. Given a mesh $\mathcal{S}$ and a user-edited target view $I_E$(a), the pipeline proceeds in three stages: (b) VecSet Encoding: $\mathcal{S}$ is encoded into a set of latent tokens $\mathbf{V}$, which serves as the workspace for editing. (c) Token Selection: To localize the editable region without 3D supervision, we analyze the internal attention maps of the LRM. Mask-guided Token Seeding aggregates informative cross-attention layers to identify initial seed tokens $\mathbf{V}_I$ that align the 2D mask. Attention-aligned Token Gating then leverages self-attention correlations to expand this selection to the full geometric structure, yielding the final editable subset $\mathbf{V}_E$. (d) Edit with Token Pruning: We perform diffusion-based editing on $\mathbf{V}_E$ while constraining the preserved tokens $\mathbf{V}_P$. To prevent geometric artifacts, Drift-aware Token Pruning gets involved during denoising to detect and discard "conflict" tokens that drift into the preserved regions but lack support from the editing condition. This ensures the final output faithfully respects both the target edit and the original structure.
  • Figure 2: Illustration of the VecSet Geometry Property. We validate that the unordered VecSet tokens exhibit spatial locality. Given a mesh $\mathcal{S}$ and a bounding box $\mathcal{B}$, we first identify the index set $\mathbb{I}$ corresponding to query points $\mathbf{P}$ that fall within $\mathcal{B}$. We then extract the token subset $\mathbf{V}_{\mathcal{B}}=\text{Gather}(\mathbf{V}, \mathbb{I})$. Finally, we quantify the reconstruction fidelity by measuring the Chamfer Distance between the geometry decoded purely from the subset ($\text{Decode}(\mathbf{V}_{\mathcal{B}})$), the reference geometry cropped from the full reconstruction ($\operatorname{Decode}(\mathbf{V}) \cap \mathcal{B}$) with cropped source mesh $\mathcal{S}\cap \mathcal{B}$.
  • Figure 3: Illustration of KL divergence in T2I and VecSet Diffusion process. In the T2I models, the layers with higher divergence are more correlated the prompt with object location. A similar pattern can be found in the VecSet Model, where the tokens with higher KL divergence are more correlated with the image.
  • Figure 4: Illustration of the VecSet RePaint process (same input condition as \ref{['fig:pipeline']}). We visualize a toy example where tokens serve as particles and their movement regions are denoted by circles. Blue and Red dots represent the preserved tokens $\mathbf{V}_P$ and edited tokens $\mathbf{V}_E$. As illustrated, at $t=0.5T$, the overlap between $\mathbf{V}_P$ and $\mathbf{V}_E$ becomes irreversible due to the contraction of the movement region.
  • Figure 5: Illustration of our proposed Detail-Preserving Texture Baking. Relying solely on the standard MV-Adapter leads to visual discrepancies in the preserved regions (highlighted in red box). In contrast, our Detail-Preserving Texture Baking effectively mitigates these errors, maintaining the fidelity of the original unedited areas (highlighted in green box).
  • ...and 8 more figures

Theorems & Definitions (1)

  • definition 1: VecSet Geometry Property