Table of Contents
Fetching ...

Anything-3D: Towards Single-view Anything Reconstruction in the Wild

Qiuhong Shen, Xingyi Yang, Xinchao Wang

TL;DR

This work tackles the challenging problem of reconstructing arbitrary 3D objects from a single real-world image. It introduces Anything-3D, a pipeline that fuses segmentation (Segment-Anything), language-based semantics (BLIP) with a textual inversion, and diffusion-prior NeRF optimization to recover geometry and texture from a single view. A coarse geometry is first obtained from Point-E, followed by diffusion-guided, score-distillation refinement to produce a high-fidelity 3D representation, even for unseen object categories. The approach demonstrates strong qualitative robustness across occlusions, lighting, and viewpoints, indicating meaningful progress toward in-the-wild single-view 3D reconstruction and offering a practical path for single-image 3D content creation.

Abstract

3D reconstruction from a single-RGB image in unconstrained real-world scenarios presents numerous challenges due to the inherent diversity and complexity of objects and environments. In this paper, we introduce Anything-3D, a methodical framework that ingeniously combines a series of visual-language models and the Segment-Anything object segmentation model to elevate objects to 3D, yielding a reliable and versatile system for single-view conditioned 3D reconstruction task. Our approach employs a BLIP model to generate textural descriptions, utilizes the Segment-Anything model for the effective extraction of objects of interest, and leverages a text-to-image diffusion model to lift object into a neural radiance field. Demonstrating its ability to produce accurate and detailed 3D reconstructions for a wide array of objects, \emph{Anything-3D\footnotemark[2]} shows promise in addressing the limitations of existing methodologies. Through comprehensive experiments and evaluations on various datasets, we showcase the merits of our approach, underscoring its potential to contribute meaningfully to the field of 3D reconstruction. Demos and code will be available at \href{https://github.com/Anything-of-anything/Anything-3D}{https://github.com/Anything-of-anything/Anything-3D}.

Anything-3D: Towards Single-view Anything Reconstruction in the Wild

TL;DR

This work tackles the challenging problem of reconstructing arbitrary 3D objects from a single real-world image. It introduces Anything-3D, a pipeline that fuses segmentation (Segment-Anything), language-based semantics (BLIP) with a textual inversion, and diffusion-prior NeRF optimization to recover geometry and texture from a single view. A coarse geometry is first obtained from Point-E, followed by diffusion-guided, score-distillation refinement to produce a high-fidelity 3D representation, even for unseen object categories. The approach demonstrates strong qualitative robustness across occlusions, lighting, and viewpoints, indicating meaningful progress toward in-the-wild single-view 3D reconstruction and offering a practical path for single-image 3D content creation.

Abstract

3D reconstruction from a single-RGB image in unconstrained real-world scenarios presents numerous challenges due to the inherent diversity and complexity of objects and environments. In this paper, we introduce Anything-3D, a methodical framework that ingeniously combines a series of visual-language models and the Segment-Anything object segmentation model to elevate objects to 3D, yielding a reliable and versatile system for single-view conditioned 3D reconstruction task. Our approach employs a BLIP model to generate textural descriptions, utilizes the Segment-Anything model for the effective extraction of objects of interest, and leverages a text-to-image diffusion model to lift object into a neural radiance field. Demonstrating its ability to produce accurate and detailed 3D reconstructions for a wide array of objects, \emph{Anything-3D\footnotemark[2]} shows promise in addressing the limitations of existing methodologies. Through comprehensive experiments and evaluations on various datasets, we showcase the merits of our approach, underscoring its potential to contribute meaningfully to the field of 3D reconstruction. Demos and code will be available at \href{https://github.com/Anything-of-anything/Anything-3D}{https://github.com/Anything-of-anything/Anything-3D}.
Paper Structure (13 sections, 6 equations, 4 figures)

This paper contains 13 sections, 6 equations, 4 figures.

Figures (4)

  • Figure 1: The Anything-3D framework proficiently recovers the 3D geometry and texture of any object from a single-view image captured in uncontrolled environments. Despite significant variations in camera perspective and object properties, our approach consistently delivers reliable recovery results.
  • Figure 2: Anything-3D combines visual-language models and object segmentation for efficient single-view 3D reconstruction. The framework employs BLIP for textual description generation and SAM for object segmentation, followed by 3D reconstruction using a pre-trained 2D text-to-image diffusion model. This model processes the 2D image and textual description, utilizing score distillation to train a neural radiance field specific to the image for image-to-3D synthesis.
  • Figure 3: Generated 3D objects of a convertible car, crane and rubble duck, visualized from five viewpoints.
  • Figure 4: Generated 3D objects of a cannon, pig bank and stool, visualized from five viewpoints.