Table of Contents
Fetching ...

LEMON: Localized Editing with Mesh Optimization and Neural Shaders

Furkan Mert Algan, Umut Yazgan, Driton Salihu, Cem Eteke, Eckehard Steinbach

TL;DR

LEMON addresses the challenge of editing polygonal meshes from multi-view images under natural language prompts while preserving the original geometry. It fuses neural deferred shading with localized mesh optimization, using vertex-level importance scores and ControlNet-conditioned diffusion to drive text-guided edits, with iterative dataset updates and mesh deformation. Evaluated on the DTU dataset, LEMON demonstrates superior alignment to prompts (via CLIP Directional Similarity) and maintains geometric integrity while delivering faster, more consistent results than several baselines. The approach offers a practical path to precise, localized mesh editing with coordinated appearance changes, though it relies on masking and suggests future inpainting-based extensions to add new geometry.

Abstract

In practical use cases, polygonal mesh editing can be faster than generating new ones, but it can still be challenging and time-consuming for users. Existing solutions for this problem tend to focus on a single task, either geometry or novel view synthesis, which often leads to disjointed results between the mesh and view. In this work, we propose LEMON, a mesh editing pipeline that combines neural deferred shading with localized mesh optimization. Our approach begins by identifying the most important vertices in the mesh for editing, utilizing a segmentation model to focus on these key regions. Given multi-view images of an object, we optimize a neural shader and a polygonal mesh while extracting the normal map and the rendered image from each view. By using these outputs as conditioning data, we edit the input images with a text-to-image diffusion model and iteratively update our dataset while deforming the mesh. This process results in a polygonal mesh that is edited according to the given text instruction, preserving the geometric characteristics of the initial mesh while focusing on the most significant areas. We evaluate our pipeline using the DTU dataset, demonstrating that it generates finely-edited meshes more rapidly than the current state-of-the-art methods. We include our code and additional results in the supplementary material.

LEMON: Localized Editing with Mesh Optimization and Neural Shaders

TL;DR

LEMON addresses the challenge of editing polygonal meshes from multi-view images under natural language prompts while preserving the original geometry. It fuses neural deferred shading with localized mesh optimization, using vertex-level importance scores and ControlNet-conditioned diffusion to drive text-guided edits, with iterative dataset updates and mesh deformation. Evaluated on the DTU dataset, LEMON demonstrates superior alignment to prompts (via CLIP Directional Similarity) and maintains geometric integrity while delivering faster, more consistent results than several baselines. The approach offers a practical path to precise, localized mesh editing with coordinated appearance changes, though it relies on masking and suggests future inpainting-based extensions to add new geometry.

Abstract

In practical use cases, polygonal mesh editing can be faster than generating new ones, but it can still be challenging and time-consuming for users. Existing solutions for this problem tend to focus on a single task, either geometry or novel view synthesis, which often leads to disjointed results between the mesh and view. In this work, we propose LEMON, a mesh editing pipeline that combines neural deferred shading with localized mesh optimization. Our approach begins by identifying the most important vertices in the mesh for editing, utilizing a segmentation model to focus on these key regions. Given multi-view images of an object, we optimize a neural shader and a polygonal mesh while extracting the normal map and the rendered image from each view. By using these outputs as conditioning data, we edit the input images with a text-to-image diffusion model and iteratively update our dataset while deforming the mesh. This process results in a polygonal mesh that is edited according to the given text instruction, preserving the geometric characteristics of the initial mesh while focusing on the most significant areas. We evaluate our pipeline using the DTU dataset, demonstrating that it generates finely-edited meshes more rapidly than the current state-of-the-art methods. We include our code and additional results in the supplementary material.
Paper Structure (15 sections, 4 equations, 13 figures, 1 table)

This paper contains 15 sections, 4 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: We propose LEMON, a polygonal mesh editing method that takes multi-view images and user-provided text instructions as input and edits the mesh while preserving the geometric characteristics of the original mesh. Our method localizes accordingly to given instruction only changes the important parts of the mesh and provides a neural shader for the novel view.
  • Figure 2: Pipeline of LEMON: We complement a multi-view mesh reconstruction model with a text-to-image model using localized features. After gathering vertex scores in our pre-processing step, we begin our editing process. Every $d$ iterations new images are generated by the ControlNet controlnet, based on the prompt. The initial noise calculation of diffusion model is derived from a weighted sum of input images and rendered images, while it is conditioned on rendered normals and images of the mesh. The generated images are masked, and the masked regions are overlaid onto the original images, creating modified versions that are then used to update the dataset. Using vertex scores as a mask on the mesh, we update only the subset of vertices that is relevant to the prompt. By continuously updating the dataset with edited images, we deform the mesh to align with the user's request.
  • Figure 3: Image and vertex scoring process. Using CLIPSeg CLIPSeg we segment most important parts of the mesh given instruction
  • Figure 4: The effect of the latent weight hyperparameter $\lambda$ on editing of the skull object from DTU for the "Turn it into Batman" prompt. Top of the skull has very bright shading, but the prompt requires the object to be darker. When $\lambda$ is set to 0, only the ground truth image is used for the initial noise calculation, resulting in the lines in the skull to stay. If $\lambda$ is too high, the rendered image may diverge to darker tones, leading to unintended edits.
  • Figure 5: Editing results on the DTU dataset DTU. Blue boxes represent the initial mesh and shader reconstructed by neural deferred shading nds, providing a baseline. Orange boxes show the edited mesh results from TextDeformer while yellow boxes represent the edited views from Instruct-NeRFNeRF. Violet boxes represent renderings from GaussianEditor and their meshes extracted by SuGaR. LEMON achieves great results in both rendering and polygonal mesh quality.
  • ...and 8 more figures