Table of Contents
Fetching ...

Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision

Michael Niemeyer, Lars Mescheder, Michael Oechsle, Andreas Geiger

TL;DR

The paper tackles 3D reconstruction without 3D supervision by introducing Differentiable Volumetric Rendering (DVR), which learns implicit shape and texture fields with analytic depth gradients derived via implicit differentiation. DVR renders 2D images from implicit representations and optimizes with 2D supervision, supported by losses for RGB, depth, and occupancy, while maintaining a memory-efficient backward pass that does not store volumetric data. The approach supports both single-view and multi-view training and yields watertight meshes, rivaling fully supervised methods on benchmarks and showing strong performance on real-world data like the DTU dataset. This work broadens the applicability of implicit representations by enabling 2D-supervised learning and directly producing high-quality 3D outputs without discretized volume grids or template meshes.

Abstract

Learning-based 3D reconstruction methods have shown impressive results. However, most methods require 3D supervision which is often hard to obtain for real-world datasets. Recently, several works have proposed differentiable rendering techniques to train reconstruction models from RGB images. Unfortunately, these approaches are currently restricted to voxel- and mesh-based representations, suffering from discretization or low resolution. In this work, we propose a differentiable rendering formulation for implicit shape and texture representations. Implicit representations have recently gained popularity as they represent shape and texture continuously. Our key insight is that depth gradients can be derived analytically using the concept of implicit differentiation. This allows us to learn implicit shape and texture representations directly from RGB images. We experimentally show that our single-view reconstructions rival those learned with full 3D supervision. Moreover, we find that our method can be used for multi-view 3D reconstruction, directly resulting in watertight meshes.

Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision

TL;DR

The paper tackles 3D reconstruction without 3D supervision by introducing Differentiable Volumetric Rendering (DVR), which learns implicit shape and texture fields with analytic depth gradients derived via implicit differentiation. DVR renders 2D images from implicit representations and optimizes with 2D supervision, supported by losses for RGB, depth, and occupancy, while maintaining a memory-efficient backward pass that does not store volumetric data. The approach supports both single-view and multi-view training and yields watertight meshes, rivaling fully supervised methods on benchmarks and showing strong performance on real-world data like the DTU dataset. This work broadens the applicability of implicit representations by enabling 2D-supervised learning and directly producing high-quality 3D outputs without discretized volume grids or template meshes.

Abstract

Learning-based 3D reconstruction methods have shown impressive results. However, most methods require 3D supervision which is often hard to obtain for real-world datasets. Recently, several works have proposed differentiable rendering techniques to train reconstruction models from RGB images. Unfortunately, these approaches are currently restricted to voxel- and mesh-based representations, suffering from discretization or low resolution. In this work, we propose a differentiable rendering formulation for implicit shape and texture representations. Implicit representations have recently gained popularity as they represent shape and texture continuously. Our key insight is that depth gradients can be derived analytically using the concept of implicit differentiation. This allows us to learn implicit shape and texture representations directly from RGB images. We experimentally show that our single-view reconstructions rival those learned with full 3D supervision. Moreover, we find that our method can be used for multi-view 3D reconstruction, directly resulting in watertight meshes.

Paper Structure

This paper contains 14 sections, 16 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview. We show that volumetric rendering is inherently differentiable for implicit shape and texture representations. Using an analytic expression for the gradient of the depth $\frac{\partial \hat{d}}{\partial \theta}$ wrt. the network parameters $\theta$, we learn implicit 3D representations $f_\theta$ from 2D images.
  • Figure 2: Differentiable Volumetric Rendering. We first predict the surface depth $\hat{d}$ by performing occupancy evaluations for a given camera matrix. To this end, we project sampled pixel $\mathbf{u}$ to 3D and evaluate the occupancy network at fixed steps on the ray cast from the camera origin towards this point. We then unproject the surface depth into 3D and evaluate the texture field at the given 3D location. The resulting 2D rendering $\mathbf{\hat{I}}$ can be compared to the ground truth image. When we also have access to ground truth depth maps, we can define a loss directly on the predicted surface depth. We can make our model conditional by incorporating an additional image encoder that predicts a global descriptor $\mathbf{z}$ of both shape and texture.
  • Figure 3: Notation. To render an object from the occupancy network $f_\theta$ and texture field $\mathbf{t}_\theta$, we cast a ray with direction $\mathbf{w}$ through a pixel $\mathbf{u}$ and determine the intersection point $\mathbf{\hat{p}}$ with the isosurface $f_\theta(\mathbf{p}) = \tau$. Afterwards, we evaluate the texture field $\mathbf{t}_\theta$ at $\mathbf{\hat{p}}$ to obtain the color prediction $\mathbf{\hat{I}}_\mathbf{u}$ at $\mathbf{u}$.
  • Figure 4: Single-View Reconstruction. We show the input renderings from Choy2016ECCV and the output of our 2D supervised ($\mathcal{L}_\text{RGB}$) and 2.5D supervised ($\mathcal{L}_\text{Depth}$) model, Soft Rasterizer LIU2019ICCV and Pixel2Mesh Wang2018ECCVc. For 2D supervised methods we use a corresponding view from KATO2018CVPR as input.
  • Figure 5: Single-View Reconstruction with Single-View Supervision. While only trained with a single-view per object, our model predicts accurate 3D geometry and texture.
  • ...and 3 more figures