Table of Contents
Fetching ...

MESC-3D:Mining Effective Semantic Cues for 3D Reconstruction from a Single Image

Shaoming Li, Qing Cai, Songqi Kong, Runqing Tan, Heng Tong, Shiji Qiu, Yongguo Jiang, Zhi Liu

TL;DR

MESC-3D tackles single-view 3D reconstruction by addressing semantic entanglement and occlusion through two innovations: an Effective Semantic Mining Module (ESM) that lets each point selectively use meaningful semantic cues, and a 3D Semantic Prior Learning Module (3DSPL) that injects human-like 3D priors via learnable text prompts aligned with point clouds through contrastive learning. The architecture combines a ResNet18 image encoder, PointMAE geometry features, and a Multimodal Interlaced Transformer (MIT) to fuse modalities, followed by staged semantic refinement guided by the prior and a final MLP decoder for per-point coordinates. Training uses cross-modal contrastive loss plus Chamfer Distance for reconstruction, and the approach demonstrates strong improvements over state-of-the-art methods on ShapeNet and Pix3D, with notable zero-shot generalization to unseen categories. The work contributes a practical framework for robust, efficient single-image 3D reconstruction with explicit semantic cue mining and 3D prior incorporation, enabling better handling of occlusion and domain shifts in real-world scenes.

Abstract

Reconstructing 3D shapes from a single image plays an important role in computer vision. Many methods have been proposed and achieve impressive performance. However, existing methods mainly focus on extracting semantic information from images and then simply concatenating it with 3D point clouds without further exploring the concatenated semantics. As a result, these entangled semantic features significantly hinder the reconstruction performance. In this paper, we propose a novel single-image 3D reconstruction method called Mining Effective Semantic Cues for 3D Reconstruction from a Single Image (MESC-3D), which can actively mine effective semantic cues from entangled features. Specifically, we design an Effective Semantic Mining Module to establish connections between point clouds and image semantic attributes, enabling the point clouds to autonomously select the necessary information. Furthermore, to address the potential insufficiencies in semantic information from a single image, such as occlusions, inspired by the human ability to represent 3D objects using prior knowledge drawn from daily experiences, we introduce a 3D Semantic Prior Learning Module. This module incorporates semantic understanding of spatial structures, enabling the model to interpret and reconstruct 3D objects with greater accuracy and realism, closely mirroring human perception of complex 3D environments. Extensive evaluations show that our method achieves significant improvements in reconstruction quality and robustness compared to prior works. Additionally, further experiments validate the strong generalization capabilities and excels in zero-shot preformance on unseen classes. Code is available at https://github.com/QINGQINGLE/MESC-3D.

MESC-3D:Mining Effective Semantic Cues for 3D Reconstruction from a Single Image

TL;DR

MESC-3D tackles single-view 3D reconstruction by addressing semantic entanglement and occlusion through two innovations: an Effective Semantic Mining Module (ESM) that lets each point selectively use meaningful semantic cues, and a 3D Semantic Prior Learning Module (3DSPL) that injects human-like 3D priors via learnable text prompts aligned with point clouds through contrastive learning. The architecture combines a ResNet18 image encoder, PointMAE geometry features, and a Multimodal Interlaced Transformer (MIT) to fuse modalities, followed by staged semantic refinement guided by the prior and a final MLP decoder for per-point coordinates. Training uses cross-modal contrastive loss plus Chamfer Distance for reconstruction, and the approach demonstrates strong improvements over state-of-the-art methods on ShapeNet and Pix3D, with notable zero-shot generalization to unseen categories. The work contributes a practical framework for robust, efficient single-image 3D reconstruction with explicit semantic cue mining and 3D prior incorporation, enabling better handling of occlusion and domain shifts in real-world scenes.

Abstract

Reconstructing 3D shapes from a single image plays an important role in computer vision. Many methods have been proposed and achieve impressive performance. However, existing methods mainly focus on extracting semantic information from images and then simply concatenating it with 3D point clouds without further exploring the concatenated semantics. As a result, these entangled semantic features significantly hinder the reconstruction performance. In this paper, we propose a novel single-image 3D reconstruction method called Mining Effective Semantic Cues for 3D Reconstruction from a Single Image (MESC-3D), which can actively mine effective semantic cues from entangled features. Specifically, we design an Effective Semantic Mining Module to establish connections between point clouds and image semantic attributes, enabling the point clouds to autonomously select the necessary information. Furthermore, to address the potential insufficiencies in semantic information from a single image, such as occlusions, inspired by the human ability to represent 3D objects using prior knowledge drawn from daily experiences, we introduce a 3D Semantic Prior Learning Module. This module incorporates semantic understanding of spatial structures, enabling the model to interpret and reconstruct 3D objects with greater accuracy and realism, closely mirroring human perception of complex 3D environments. Extensive evaluations show that our method achieves significant improvements in reconstruction quality and robustness compared to prior works. Additionally, further experiments validate the strong generalization capabilities and excels in zero-shot preformance on unseen classes. Code is available at https://github.com/QINGQINGLE/MESC-3D.

Paper Structure

This paper contains 17 sections, 9 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: The overall architecture of MESC-3D. Our network is composed of two main components. (a) The 3DSPL align point cloud modality features with text features, aiming to capture the unique 3D geometric characteristics of each category. (b) The ESM establishes a connection between the semantic feature $F_{i}$ and the 3D point cloud at $i^{th}$ stage, allowing each point to autonomously select the most valuable semantic information.
  • Figure 2: Visual comparison of 2D-to-3D reconstruction results with different methods under ShapeNet dataset. Additional qualitative results are provided in the supplementary material.
  • Figure 3: Visual comparison of 2D-to-3D reconstruction results with different methods under Pix3D dataset.
  • Figure 4: Ablation Study on learnable text prompt. Visual results on ShapeNet.
  • Figure 5: Generalization on base classes with various methods.
  • ...and 7 more figures