Table of Contents
Fetching ...

Refine3DNet: Scaling Precision in 3D Object Reconstruction from Multi-View RGB Images using Attention

Ajith Balakrishnan, Sreeja S, Linu Shine

TL;DR

Refine3DNet tackles 3D object reconstruction from multi-view RGB images by fusing CNN-based feature extraction with Transformer-style self-attention, followed by a 3D U-Net refiner. A novel Joint Train Separate Optimization (JTSO) procedure decouples encoder-decoder and attention/refiner learning to improve robustness with varying numbers of input views. On ShapeNet, the approach achieves state-of-the-art or near-state-of-the-art IoU scores in both single-view and multi-view settings, with notable gains in single-view reconstruction and strong performance across view counts up to twenty. The work demonstrates the practicality of a hybrid CNN-Transformer architecture with a dedicated optimization strategy for scalable, accurate 3D voxel reconstruction, albeit with higher resource requirements than some baselines.

Abstract

Generating 3D models from multi-view 2D RGB images has gained significant attention, extending the capabilities of technologies like Virtual Reality, Robotic Vision, and human-machine interaction. In this paper, we introduce a hybrid strategy combining CNNs and transformers, featuring a visual auto-encoder with self-attention mechanisms and a 3D refiner network, trained using a novel Joint Train Separate Optimization (JTSO) algorithm. Encoded features from unordered inputs are transformed into an enhanced feature map by the self-attention layer, decoded into an initial 3D volume, and further refined. Our network generates 3D voxels from single or multiple 2D images from arbitrary viewpoints. Performance evaluations using the ShapeNet datasets show that our approach, combined with JTSO, outperforms state-of-the-art techniques in single and multi-view 3D reconstruction, achieving the highest mean intersection over union (IOU) scores, surpassing other models by 4.2% in single-view reconstruction.

Refine3DNet: Scaling Precision in 3D Object Reconstruction from Multi-View RGB Images using Attention

TL;DR

Refine3DNet tackles 3D object reconstruction from multi-view RGB images by fusing CNN-based feature extraction with Transformer-style self-attention, followed by a 3D U-Net refiner. A novel Joint Train Separate Optimization (JTSO) procedure decouples encoder-decoder and attention/refiner learning to improve robustness with varying numbers of input views. On ShapeNet, the approach achieves state-of-the-art or near-state-of-the-art IoU scores in both single-view and multi-view settings, with notable gains in single-view reconstruction and strong performance across view counts up to twenty. The work demonstrates the practicality of a hybrid CNN-Transformer architecture with a dedicated optimization strategy for scalable, accurate 3D voxel reconstruction, albeit with higher resource requirements than some baselines.

Abstract

Generating 3D models from multi-view 2D RGB images has gained significant attention, extending the capabilities of technologies like Virtual Reality, Robotic Vision, and human-machine interaction. In this paper, we introduce a hybrid strategy combining CNNs and transformers, featuring a visual auto-encoder with self-attention mechanisms and a 3D refiner network, trained using a novel Joint Train Separate Optimization (JTSO) algorithm. Encoded features from unordered inputs are transformed into an enhanced feature map by the self-attention layer, decoded into an initial 3D volume, and further refined. Our network generates 3D voxels from single or multiple 2D images from arbitrary viewpoints. Performance evaluations using the ShapeNet datasets show that our approach, combined with JTSO, outperforms state-of-the-art techniques in single and multi-view 3D reconstruction, achieving the highest mean intersection over union (IOU) scores, surpassing other models by 4.2% in single-view reconstruction.

Paper Structure

This paper contains 18 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Visualization of Refine3DNet for 3D Object Reconstruction from multiple images.
  • Figure 2: Detailed architecture of the proposed network
  • Figure 3: Illustration of multi-head attention and Scaled dot-product attention
  • Figure 4: Detailed architecture of Refiner Network
  • Figure 5: Proposed JTSO Algorithm flow
  • ...and 2 more figures