Topology-based Representative Datasets to Reduce Neural Network Training Resources
Rocio Gonzalez-Diaz, Miguel A. Gutiérrez-Naranjo, Eduardo Paluzo-Hidalgo
TL;DR
The paper tackles the challenge of lengthy neural network training by introducing ε-representative datasets that preserve the original data's topological structure. Representativeness is quantified through topological tools, notably persistence diagrams and the bottleneck distance, with the Gromov-Hausdorff distance serving as a theoretical bound. The authors prove that for a binary perceptron trained with mean squared error, accuracy on a representative dataset matches that on the full dataset under suitable ε, and they validate this claim through experiments on synthetic data, the Iris dataset, and multi-layer networks. Collectively, the results demonstrate that topologically informed data reduction can dramatically speed up training while retaining predictive performance, motivating broader application to other architectures and training regimes.
Abstract
One of the main drawbacks of the practical use of neural networks is the long time required in the training process. Such a training process consists of an iterative change of parameters trying to minimize a loss function. These changes are driven by a dataset, which can be seen as a set of labelled points in an n-dimensional space. In this paper, we explore the concept of are representative dataset which is a dataset smaller than the original one, satisfying a nearness condition independent of isometric transformations. Representativeness is measured using persistence diagrams (a computational topology tool) due to its computational efficiency. We prove that the accuracy of the learning process of a neural network on a representative dataset is "similar" to the accuracy on the original dataset when the neural network architecture is a perceptron and the loss function is the mean squared error. These theoretical results accompanied by experimentation open a door to reducing the size of the dataset to gain time in the training process of any neural network.
