Table of Contents
Fetching ...

Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation

Yiftach Edelstein, Or Patashnik, Dana Cohen-Bar, Lihi Zelnik-Manor

TL;DR

Sharp-It tackles the quality-controllability gap between native 3D generative approaches and multi-view reconstruction by refining a low-quality, 3D-consistent object across multiple views with a diffusion model that shares information across views. It leverages a Shap-E backbone for 3D structure and a 8-channel conditioning scheme, trained on a large paired dataset of degraded/high-quality multi-view renders, with cross-view attention guiding 3D-consistent refinements. The approach achieves superior FID and semantic alignment (CLIP/DINO) compared to baselines and enables text-to-3D synthesis, editing, appearance edits, and controlled generation with competitive or faster runtimes. By uniting 3D-aware generation with diffusion-based detail synthesis, Sharp-It offers a practical, efficient pathway to high-quality, editable 3D assets suitable for fast content creation and manipulation.

Abstract

Advancements in text-to-image diffusion models have led to significant progress in fast 3D content creation. One common approach is to generate a set of multi-view images of an object, and then reconstruct it into a 3D model. However, this approach bypasses the use of a native 3D representation of the object and is hence prone to geometric artifacts and limited in controllability and manipulation capabilities. An alternative approach involves native 3D generative models that directly produce 3D representations. These models, however, are typically limited in their resolution, resulting in lower quality 3D objects. In this work, we bridge the quality gap between methods that directly generate 3D representations and ones that reconstruct 3D objects from multi-view images. We introduce a multi-view to multi-view diffusion model called Sharp-It, which takes a 3D consistent set of multi-view images rendered from a low-quality object and enriches its geometric details and texture. The diffusion model operates on the multi-view set in parallel, in the sense that it shares features across the generated views. A high-quality 3D model can then be reconstructed from the enriched multi-view set. By leveraging the advantages of both 2D and 3D approaches, our method offers an efficient and controllable method for high-quality 3D content creation. We demonstrate that Sharp-It enables various 3D applications, such as fast synthesis, editing, and controlled generation, while attaining high-quality assets.

Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation

TL;DR

Sharp-It tackles the quality-controllability gap between native 3D generative approaches and multi-view reconstruction by refining a low-quality, 3D-consistent object across multiple views with a diffusion model that shares information across views. It leverages a Shap-E backbone for 3D structure and a 8-channel conditioning scheme, trained on a large paired dataset of degraded/high-quality multi-view renders, with cross-view attention guiding 3D-consistent refinements. The approach achieves superior FID and semantic alignment (CLIP/DINO) compared to baselines and enables text-to-3D synthesis, editing, appearance edits, and controlled generation with competitive or faster runtimes. By uniting 3D-aware generation with diffusion-based detail synthesis, Sharp-It offers a practical, efficient pathway to high-quality, editable 3D assets suitable for fast content creation and manipulation.

Abstract

Advancements in text-to-image diffusion models have led to significant progress in fast 3D content creation. One common approach is to generate a set of multi-view images of an object, and then reconstruct it into a 3D model. However, this approach bypasses the use of a native 3D representation of the object and is hence prone to geometric artifacts and limited in controllability and manipulation capabilities. An alternative approach involves native 3D generative models that directly produce 3D representations. These models, however, are typically limited in their resolution, resulting in lower quality 3D objects. In this work, we bridge the quality gap between methods that directly generate 3D representations and ones that reconstruct 3D objects from multi-view images. We introduce a multi-view to multi-view diffusion model called Sharp-It, which takes a 3D consistent set of multi-view images rendered from a low-quality object and enriches its geometric details and texture. The diffusion model operates on the multi-view set in parallel, in the sense that it shares features across the generated views. A high-quality 3D model can then be reconstructed from the enriched multi-view set. By leveraging the advantages of both 2D and 3D approaches, our method offers an efficient and controllable method for high-quality 3D content creation. We demonstrate that Sharp-It enables various 3D applications, such as fast synthesis, editing, and controlled generation, while attaining high-quality assets.

Paper Structure

This paper contains 28 sections, 1 equation, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Sharp-It is a multi-view to multi-view model that enhances low-quality 3D shapes. It corrects fine-grained geometry details and adds appearance features. The top row displays three degraded shapes and their enhancements by Sharp-It. The bottom row demonstrates Sharp-It's capability to edit the appearance of 3D shapes.
  • Figure 2: Overview of 3D generation pipeline with Sharp-It. First, a 3D object is generated with Shap-E. Then, we render six views of this low-quality object. Sharp-It is a diffusion model based on Stable Diffusion rombach2022highresolutionimagesynthesislatent that enhances these views with the guidance of a text prompt by refining geometry and adding detailed appearance. Sharp-It employs cross-attention layers for text-based guidance and self-attention layers for cross-view consistency. A high-quality 3D object can be reconstructed from the multi-view image set.
  • Figure 3: Self-attention maps for a query point (red) on the car's wheel, showing highest attention weights at corresponding wheel locations across different views.
  • Figure 4: Comparison of Sharp-It with other methods for 3D object enhancement (GaussianDreamer and MVEdit), and multi-view enhancement (SDEdit based). The first column shows the input object generated by Shap-E. As can be seen, our method achieves the highest quality results while best preserving the input object.
  • Figure 5: Qualitative ablation study. The first column shows the degraded input object generated by Shap-E. Subsequent columns show the effects of removing specific components: omitting the text prompt leads to reduced texture detail, while excluding diverse lighting results in a flatter appearance with less realistic shading. The full model achieves the most refined and detailed result.
  • ...and 4 more figures