Table of Contents
Fetching ...

DPA-Net: Structured 3D Abstraction from Sparse Views via Differentiable Primitive Assembly

Fenggen Yu, Yiming Qian, Xu Zhang, Francisca Gil-Ureta, Brian Jackson, Eric Bennett, Hao Zhang

TL;DR

DPA-Net tackles the problem of learning structured 3D abstractions as primitive assemblies from a handful of RGB views without 3D supervision. It integrates a differentiable primitive assembly into a NeRF-like framework to predict a 3D occupancy field from a union of convex quadric primitives, enabling volume rendering-based losses in image space. The method introduces test-time adaptation and improvements such as silhouette-aware sampling and primitive dropout to improve fidelity and compactness, outperforming state-of-the-art sparse-view approaches on ShapeNet and DTU while yielding interpretable, editable 3D structures. These structured abstractions can serve as CAD-friendly inputs or as structural prompts for downstream 3D generation tasks, highlighting practical utility in editing, assembly, and design workflows.

Abstract

We present a differentiable rendering framework to learn structured 3D abstractions in the form of primitive assemblies from sparse RGB images capturing a 3D object. By leveraging differentiable volume rendering, our method does not require 3D supervision. Architecturally, our network follows the general pipeline of an image-conditioned neural radiance field (NeRF) exemplified by pixelNeRF for color prediction. As our core contribution, we introduce differential primitive assembly (DPA) into NeRF to output a 3D occupancy field in place of density prediction, where the predicted occupancies serve as opacity values for volume rendering. Our network, coined DPA-Net, produces a union of convexes, each as an intersection of convex quadric primitives, to approximate the target 3D object, subject to an abstraction loss and a masking loss, both defined in the image space upon volume rendering. With test-time adaptation and additional sampling and loss designs aimed at improving the accuracy and compactness of the obtained assemblies, our method demonstrates superior performance over state-of-the-art alternatives for 3D primitive abstraction from sparse views.

DPA-Net: Structured 3D Abstraction from Sparse Views via Differentiable Primitive Assembly

TL;DR

DPA-Net tackles the problem of learning structured 3D abstractions as primitive assemblies from a handful of RGB views without 3D supervision. It integrates a differentiable primitive assembly into a NeRF-like framework to predict a 3D occupancy field from a union of convex quadric primitives, enabling volume rendering-based losses in image space. The method introduces test-time adaptation and improvements such as silhouette-aware sampling and primitive dropout to improve fidelity and compactness, outperforming state-of-the-art sparse-view approaches on ShapeNet and DTU while yielding interpretable, editable 3D structures. These structured abstractions can serve as CAD-friendly inputs or as structural prompts for downstream 3D generation tasks, highlighting practical utility in editing, assembly, and design workflows.

Abstract

We present a differentiable rendering framework to learn structured 3D abstractions in the form of primitive assemblies from sparse RGB images capturing a 3D object. By leveraging differentiable volume rendering, our method does not require 3D supervision. Architecturally, our network follows the general pipeline of an image-conditioned neural radiance field (NeRF) exemplified by pixelNeRF for color prediction. As our core contribution, we introduce differential primitive assembly (DPA) into NeRF to output a 3D occupancy field in place of density prediction, where the predicted occupancies serve as opacity values for volume rendering. Our network, coined DPA-Net, produces a union of convexes, each as an intersection of convex quadric primitives, to approximate the target 3D object, subject to an abstraction loss and a masking loss, both defined in the image space upon volume rendering. With test-time adaptation and additional sampling and loss designs aimed at improving the accuracy and compactness of the obtained assemblies, our method demonstrates superior performance over state-of-the-art alternatives for 3D primitive abstraction from sparse views.
Paper Structure (15 sections, 3 equations, 8 figures, 5 tables)

This paper contains 15 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Our method takes as few as three RGB images from disparate views and abstracts a textured 3D shape formed by a union of convex parts that well reflect the shape semantics. With a differentiable primitive assembly subject to only image-space losses, our network is trained without 3D supervision. With the meaningful parts abstracted, the resulting shape can be edited.
  • Figure 2: Overview of DPA-Net. Given a sparse set of input RGB images whose viewpoints can be significantly different, our network is trained to predict a 3D primitive assembly, i.e., a 3D abstraction, via differentiable volume rendering without 3D supervision. The high-level network architecture resembles that of an image-conditioned NeRF such as pixelNeRF yu2021pixelnerf for color prediction from multi-scale image features. What is new is that the density estimation in NeRF is replaced by our novel differentiable primitive assembly (DPA). DPA takes as input multi-view image features from ResNet that are fused into a shape feature via weighted pooling. The shape feature is further passed into an MLP (the primitive decoder) to predict the parameters of a set of convex quadric primitives. 3D query points and the primitives are assembled by two CSG- based assembly layers (intersection and then union) to predict point occupancies, which serve as opacity values for both volume rendering and for predicting an image mask. An RGB loss and a masking loss are calculated against the input images and object masks. Note that we assume that the camera poses and object masks are either provided or estimated prior to our 3D abstraction.
  • Figure 3: Visual comparisons on the ShapeNet chair benchmark. The red ovals highlights noisy surfaces, which can be seen more clearly if the image is zoomed in upon.
  • Figure 4: Ablation studies on the three key strategies to boost DPA-Net. The texts, #P's, in the last column show the number of qudaric primitives used. The cyan ovals highlight challenging reconstruction areas that our method with TTA or adaptive sampling correctly recovers. The red ovals highlight redundant overlapping convex parts.
  • Figure 5: Qualitative comparisons in the category-agnostic setting. The image resolution is $64\times 64$.
  • ...and 3 more figures