Table of Contents
Fetching ...

Topology-based Representative Datasets to Reduce Neural Network Training Resources

Rocio Gonzalez-Diaz, Miguel A. Gutiérrez-Naranjo, Eduardo Paluzo-Hidalgo

TL;DR

The paper tackles the challenge of lengthy neural network training by introducing ε-representative datasets that preserve the original data's topological structure. Representativeness is quantified through topological tools, notably persistence diagrams and the bottleneck distance, with the Gromov-Hausdorff distance serving as a theoretical bound. The authors prove that for a binary perceptron trained with mean squared error, accuracy on a representative dataset matches that on the full dataset under suitable ε, and they validate this claim through experiments on synthetic data, the Iris dataset, and multi-layer networks. Collectively, the results demonstrate that topologically informed data reduction can dramatically speed up training while retaining predictive performance, motivating broader application to other architectures and training regimes.

Abstract

One of the main drawbacks of the practical use of neural networks is the long time required in the training process. Such a training process consists of an iterative change of parameters trying to minimize a loss function. These changes are driven by a dataset, which can be seen as a set of labelled points in an n-dimensional space. In this paper, we explore the concept of are representative dataset which is a dataset smaller than the original one, satisfying a nearness condition independent of isometric transformations. Representativeness is measured using persistence diagrams (a computational topology tool) due to its computational efficiency. We prove that the accuracy of the learning process of a neural network on a representative dataset is "similar" to the accuracy on the original dataset when the neural network architecture is a perceptron and the loss function is the mean squared error. These theoretical results accompanied by experimentation open a door to reducing the size of the dataset to gain time in the training process of any neural network.

Topology-based Representative Datasets to Reduce Neural Network Training Resources

TL;DR

The paper tackles the challenge of lengthy neural network training by introducing ε-representative datasets that preserve the original data's topological structure. Representativeness is quantified through topological tools, notably persistence diagrams and the bottleneck distance, with the Gromov-Hausdorff distance serving as a theoretical bound. The authors prove that for a binary perceptron trained with mean squared error, accuracy on a representative dataset matches that on the full dataset under suitable ε, and they validate this claim through experiments on synthetic data, the Iris dataset, and multi-layer networks. Collectively, the results demonstrate that topologically informed data reduction can dramatically speed up training while retaining predictive performance, motivating broader application to other architectures and training regimes.

Abstract

One of the main drawbacks of the practical use of neural networks is the long time required in the training process. Such a training process consists of an iterative change of parameters trying to minimize a loss function. These changes are driven by a dataset, which can be seen as a set of labelled points in an n-dimensional space. In this paper, we explore the concept of are representative dataset which is a dataset smaller than the original one, satisfying a nearness condition independent of isometric transformations. Representativeness is measured using persistence diagrams (a computational topology tool) due to its computational efficiency. We prove that the accuracy of the learning process of a neural network on a representative dataset is "similar" to the accuracy on the original dataset when the neural network architecture is a perceptron and the loss function is the mean squared error. These theoretical results accompanied by experimentation open a door to reducing the size of the dataset to gain time in the training process of any neural network.

Paper Structure

This paper contains 15 sections, 13 theorems, 39 equations, 21 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Chazal2014 For any two subsets $X$ and $Y$ of $\mathbb{R}^n$, and for any dimension $q\leq n$, the bottleneck distance between the persistence diagrams of $X$ and $Y$, $\hbox{Dgm}_q(X)$ and $\hbox{Dgm}_q(Y)$, is bounded by the Gromov-Hausdorff distance of $X$ and $Y$:

Figures (21)

  • Figure 1: A point cloud sampling two interlaced solid torus and the $\varepsilon$-proximity graph of one of them for a fixed $\varepsilon$.
  • Figure 2: A binary classification problem given by a sampled circumference. In this case, the classification problem tries to distinguish between the upper and the lower part of the circumference.
  • Figure 3: ($\varepsilon_1$-Representative dataset) A subset of the sampled circumference given in Fig. \ref{['fig:binary_or']}. Let us observe that the decision boundary obtained is similar to the one showed in Fig. \ref{['fig:binary_or']}.
  • Figure 4: ($\varepsilon_2$-Representative dataset) A subset of the sampled circumference given in Fig. \ref{['fig:binary_or']}. Let us observe that the decision boundary obtained is quite different to the one showed in Fig. \ref{['fig:binary_or']}.
  • Figure 6: Different synthetic datasets generated using the Scikit-learn python package implementation. The first column corresponds to original datasets, the second column corresponds to dominating datasets of the original datasets, and the third column corresponds to random subsets of the original datasets of the same size as the corresponding dominating set.
  • ...and 16 more figures

Theorems & Definitions (31)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 1
  • Definition 4
  • Definition 5
  • Proposition 1
  • proof
  • Corollary 1
  • Definition 6
  • ...and 21 more