Table of Contents
Fetching ...

Single-View 3D Reconstruction via SO(2)-Equivariant Gaussian Sculpting Networks

Ruihan Xu, Anthony Opipari, Joshua Mah, Stanley Lewis, Haoran Zhang, Hanzhe Guo, Odest Chadwicke Jenkins

TL;DR

The paper addresses the challenge of fast, reliable single-view 3D reconstruction with explicit geometry suitable for robotics. It introduces SO(2)-Equivariant Gaussian Sculpting Networks (GSNs) that deform a canonical Gaussian cube into a target object by predicting per-Gaussian deviations for position, scale, rotation, color, and opacity, all decoded from a ResNet-based encoder and trained with a multi-view rendering loss plus Extended Chamfer Distance to enforce equivariance. Key contributions include real-time performance (>150 FPS), competitive reconstruction quality against diffusion-based baselines, and a demonstration of object-centric grasp planning in a robotic pipeline. The work highlights the practical potential of explicit Gaussian splatting with equivariant priors for fast perceptual understanding in cluttered robotic applications, while noting limitations such as generalization beyond ShapeNet and future directions toward scene-level reconstructions and richer datasets.

Abstract

This paper introduces SO(2)-Equivariant Gaussian Sculpting Networks (GSNs) as an approach for SO(2)-Equivariant 3D object reconstruction from single-view image observations. GSNs take a single observation as input to generate a Gaussian splat representation describing the observed object's geometry and texture. By using a shared feature extractor before decoding Gaussian colors, covariances, positions, and opacities, GSNs achieve extremely high throughput (>150FPS). Experiments demonstrate that GSNs can be trained efficiently using a multi-view rendering loss and are competitive, in quality, with expensive diffusion-based reconstruction algorithms. The GSN model is validated on multiple benchmark experiments. Moreover, we demonstrate the potential for GSNs to be used within a robotic manipulation pipeline for object-centric grasping.

Single-View 3D Reconstruction via SO(2)-Equivariant Gaussian Sculpting Networks

TL;DR

The paper addresses the challenge of fast, reliable single-view 3D reconstruction with explicit geometry suitable for robotics. It introduces SO(2)-Equivariant Gaussian Sculpting Networks (GSNs) that deform a canonical Gaussian cube into a target object by predicting per-Gaussian deviations for position, scale, rotation, color, and opacity, all decoded from a ResNet-based encoder and trained with a multi-view rendering loss plus Extended Chamfer Distance to enforce equivariance. Key contributions include real-time performance (>150 FPS), competitive reconstruction quality against diffusion-based baselines, and a demonstration of object-centric grasp planning in a robotic pipeline. The work highlights the practical potential of explicit Gaussian splatting with equivariant priors for fast perceptual understanding in cluttered robotic applications, while noting limitations such as generalization beyond ShapeNet and future directions toward scene-level reconstructions and richer datasets.

Abstract

This paper introduces SO(2)-Equivariant Gaussian Sculpting Networks (GSNs) as an approach for SO(2)-Equivariant 3D object reconstruction from single-view image observations. GSNs take a single observation as input to generate a Gaussian splat representation describing the observed object's geometry and texture. By using a shared feature extractor before decoding Gaussian colors, covariances, positions, and opacities, GSNs achieve extremely high throughput (>150FPS). Experiments demonstrate that GSNs can be trained efficiently using a multi-view rendering loss and are competitive, in quality, with expensive diffusion-based reconstruction algorithms. The GSN model is validated on multiple benchmark experiments. Moreover, we demonstrate the potential for GSNs to be used within a robotic manipulation pipeline for object-centric grasping.
Paper Structure (17 sections, 5 equations, 4 figures, 5 tables)

This paper contains 17 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 2: SO(2)-Equivariant Gaussian Sculpting Network Architecture. Our Encoder-Decoder style network takes an input image and encodes it into a latent vector. Subsequently, a decoder with parallel MLPs decodes the latent vector into Gaussian parameters, sculpting a canonical Gaussian Splat into a 3D object presented in the input image. Finally, we perform multi-view rendering to obtain various novel views for loss calculation.
  • Figure 3: Qualitative single-view reconstruction comparison. We reused the visualization results from Splatter-Image and added our results for comparision.
  • Figure 4: Generated grasp using Dex-Net 2.0 from novel views rendered from 3D Gaussian Splat generated by our model with canonical view and rotated view as input. Dex-Net can generate grasps for for cars with grasp quality q-vralue=0.733 with canonical view and q-value=0.614 for rotated view.
  • Figure 5: Simulationof Robot Grasping using Kuka manipulator in Pybullet. Check out the video demo here: https://youtu.be/USOLfgzOTf4