Table of Contents
Fetching ...

HYVE: Hybrid Vertex Encoder for Neural Distance Fields

Stefan Rhys Jeske, Jonathan Klein, Dominik L. Michels, Jan Bender

TL;DR

HYVE addresses efficient, high-fidelity 3D shape encoding by learning a single-pass encoder–decoder for neural distance fields. It combines a multi-scale hybrid graph- and grid-based architecture with a novel point-to-grid feature transfer and a lightweight, SIREN-inspired decoder, trained via the eikonal equation using only surface samples. The method handles non-manifold and non-watertight geometries with a simple loss modification and delivers superior surface detail with fast inference across large point clouds. Its use of latent grids and a physically-inspired projection lays groundwork for scalable, editable neural distance fields in practical 3D pipelines.

Abstract

Neural shape representation generally refers to representing 3D geometry using neural networks, e.g., computing a signed distance or occupancy value at a specific spatial position. In this paper we present a neural-network architecture suitable for accurate encoding of 3D shapes in a single forward pass. Our architecture is based on a multi-scale hybrid system incorporating graph-based and voxel-based components, as well as a continuously differentiable decoder. The hybrid system includes a novel way of voxelizing point-based features in neural networks, which we show can be used in combination with oriented point-clouds to obtain smoother and more detailed reconstructions. Furthermore, our network is trained to solve the eikonal equation and only requires knowledge of the zero-level set for training and inference. This means that in contrast to most previous shape encoder architectures, our network is able to output valid signed distance fields without explicit prior knowledge of non-zero distance values or shape occupancy. It also requires only a single forward-pass, instead of the latent-code optimization used in auto-decoder methods. We further propose a modification to the loss function in case that surface normals are not well defined, e.g., in the context of non-watertight surfaces and non-manifold geometry, resulting in an unsigned distance field. Overall, our system can help to reduce the computational overhead of training and evaluating neural distance fields, as well as enabling the application to difficult geometry.

HYVE: Hybrid Vertex Encoder for Neural Distance Fields

TL;DR

HYVE addresses efficient, high-fidelity 3D shape encoding by learning a single-pass encoder–decoder for neural distance fields. It combines a multi-scale hybrid graph- and grid-based architecture with a novel point-to-grid feature transfer and a lightweight, SIREN-inspired decoder, trained via the eikonal equation using only surface samples. The method handles non-manifold and non-watertight geometries with a simple loss modification and delivers superior surface detail with fast inference across large point clouds. Its use of latent grids and a physically-inspired projection lays groundwork for scalable, editable neural distance fields in practical 3D pipelines.

Abstract

Neural shape representation generally refers to representing 3D geometry using neural networks, e.g., computing a signed distance or occupancy value at a specific spatial position. In this paper we present a neural-network architecture suitable for accurate encoding of 3D shapes in a single forward pass. Our architecture is based on a multi-scale hybrid system incorporating graph-based and voxel-based components, as well as a continuously differentiable decoder. The hybrid system includes a novel way of voxelizing point-based features in neural networks, which we show can be used in combination with oriented point-clouds to obtain smoother and more detailed reconstructions. Furthermore, our network is trained to solve the eikonal equation and only requires knowledge of the zero-level set for training and inference. This means that in contrast to most previous shape encoder architectures, our network is able to output valid signed distance fields without explicit prior knowledge of non-zero distance values or shape occupancy. It also requires only a single forward-pass, instead of the latent-code optimization used in auto-decoder methods. We further propose a modification to the loss function in case that surface normals are not well defined, e.g., in the context of non-watertight surfaces and non-manifold geometry, resulting in an unsigned distance field. Overall, our system can help to reduce the computational overhead of training and evaluating neural distance fields, as well as enabling the application to difficult geometry.
Paper Structure (23 sections, 7 equations, 6 figures, 6 tables)

This paper contains 23 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The reconstruction of a 3D scan of a beehive (left) and comparisons of input points and our respective reconstructions on the Objaverse dataset (right) deitkeObjaverseUniverseAnnotated2023. These examples show the capability of our model to encode a large amount of detail in a single forward pass, using only oriented point clouds as input.
  • Figure 2: Convolution block that extracts features for a specific grid resolution. For clarity of illustration, a 2D rather than 3D grid is shown here. The input is a set of vertices (with position / feature data) and edges (encoded as vertex indices). The + denotes element-wise vector addition. The block has two outputs, feature values on the vertices and grid values for each grid cell. For all resolutions, a $2\times2$ convolution kernel is used. $n$: number of vertices. $f$: number of features (on the first level, the features are the spatial coordinate of each vertex).
  • Figure 3: The encoder-decoder architecture of our network. The encoder computes vertex and volumetric features at multiple resolutions. By passing the feature vector through the convolution blocks, neighbor information is collected. The implementation of the convolution blocks is shown in Figure \ref{['fig:grid-block']}. After the last block, the vertex feature vector is discarded. The + denotes element-wise vector addition. $n$: number of vertices. $f$: number of features. $s$: number of SDF sample points.
  • Figure 4: From top to bottom we increase the size of all latent grids, while from left to right the size of the latent feature is increased. Below each figure the inference time for 17M points ($256^3$ regular grid) and the storage requirements for the latent grid is shown. The number of trainable network parameters for feature sizes 16, 32, and 64 are 38K, 151K, and 604K, respectively, regardless of the specific grid sizes.
  • Figure 5: Comparing our method to related baselines. Please refer to the accompanying video for a more immersive comparison.
  • ...and 1 more figures