Challenges in 3D Data Synthesis for Training Neural Networks on Topological Features
Dylan Peek, Matthew P. Skerritt, Siddharth Pritam, Stephan Chalup
TL;DR
This work addresses the scarcity of labeled 3D data for supervised Topological Data Analysis by introducing a synthetic data pipeline that controllably encodes topology through genus $g$ and Betti number $\beta_1$ using Repulsive Surfaces. It presents the RG Repulse dataset, built from seeds with $\beta_1$ in $[0,20]$ grown in random grid environments, deformed while preserving topology, and voxelized to $256^3$ with Perlin-noise augmentation, enabling $8{,}192$-point inputs for a 3D Convolutional Transformer network trained to predict $\beta_1$. Experimental results show accuracy decreases as geometric complexity increases, highlighting the separate roles of topology and geometry in estimator generalization. The dataset provides a flexible platform for training and benchmarking topology-aware estimators and persistent homology pipelines, with potential for transfer learning to real-world voxel data in domains like medical imaging and materials science.
Abstract
Topological Data Analysis (TDA) involves techniques of analyzing the underlying structure and connectivity of data. However, traditional methods like persistent homology can be computationally demanding, motivating the development of neural network-based estimators capable of reducing computational overhead and inference time. A key barrier to advancing these methods is the lack of labeled 3D data with class distributions and diversity tailored specifically for supervised learning in TDA tasks. To address this, we introduce a novel approach for systematically generating labeled 3D datasets using the Repulsive Surface algorithm, allowing control over topological invariants, such as hole count. The resulting dataset offers varied geometry with topological labeling, making it suitable for training and benchmarking neural network estimators. This paper uses a synthetic 3D dataset to train a genus estimator network, created using a 3D convolutional transformer architecture. An observed decrease in accuracy as deformations increase highlights the role of not just topological complexity, but also geometric complexity, when training generalized estimators. This dataset fills a gap in labeled 3D datasets and generation for training and evaluating models and techniques for TDA.
