Graph Semi-Supervised Learning for Point Classification on Data Manifolds
Caio F. Deberaldini Netto, Zhiyang Wang, Luana Ruiz
TL;DR
This work addresses point classification on data manifolds by embedding data with a variational autoencoder to approximate the manifold ${\mathcal M}$, constructing a Gaussian-geometric graph in embedding space, and applying a graph neural network (GNN) for semi-supervised node classification. The authors develop a theoretical framework showing that the semi-supervised generalization gap on geometric graphs shrinks as the graph size grows, and they introduce a growing-graph training scheme to drive the gap toward zero asymptotically, connecting graph-based learning with manifold geometry. Empirically, the method yields strong image classification performance across benchmarks (e.g., MNIST, FMNIST, CIFAR10, FER2013, CelebA, PathMNIST) and demonstrates reduced generalization gaps when training on sequences of increasing graphs, while VAEs provide superior latent structure versus PCA baselines. Overall, the paper offers a principled, geometry-aware pipeline that combines VAE embeddings, geometric graph construction, and GNNs to enhance generalization in high-dimensional classification tasks.
Abstract
We propose a graph semi-supervised learning framework for classification tasks on data manifolds. Motivated by the manifold hypothesis, we model data as points sampled from a low-dimensional manifold $\mathcal{M} \subset \mathbb{R}^F$. The manifold is approximated in an unsupervised manner using a variational autoencoder (VAE), where the trained encoder maps data to embeddings that represent their coordinates in $\mathbb{R}^F$. A geometric graph is constructed with Gaussian-weighted edges inversely proportional to distances in the embedding space, transforming the point classification problem into a semi-supervised node classification task on the graph. This task is solved using a graph neural network (GNN). Our main contribution is a theoretical analysis of the statistical generalization properties of this data-to-manifold-to-graph pipeline. We show that, under uniform sampling from $\mathcal{M}$, the generalization gap of the semi-supervised task diminishes with increasing graph size, up to the GNN training error. Leveraging a training procedure which resamples a slightly larger graph at regular intervals during training, we then show that the generalization gap can be reduced even further, vanishing asymptotically. Finally, we validate our findings with numerical experiments on image classification benchmarks, demonstrating the empirical effectiveness of our approach.
