Table of Contents
Fetching ...

Graph Semi-Supervised Learning for Point Classification on Data Manifolds

Caio F. Deberaldini Netto, Zhiyang Wang, Luana Ruiz

TL;DR

This work addresses point classification on data manifolds by embedding data with a variational autoencoder to approximate the manifold ${\mathcal M}$, constructing a Gaussian-geometric graph in embedding space, and applying a graph neural network (GNN) for semi-supervised node classification. The authors develop a theoretical framework showing that the semi-supervised generalization gap on geometric graphs shrinks as the graph size grows, and they introduce a growing-graph training scheme to drive the gap toward zero asymptotically, connecting graph-based learning with manifold geometry. Empirically, the method yields strong image classification performance across benchmarks (e.g., MNIST, FMNIST, CIFAR10, FER2013, CelebA, PathMNIST) and demonstrates reduced generalization gaps when training on sequences of increasing graphs, while VAEs provide superior latent structure versus PCA baselines. Overall, the paper offers a principled, geometry-aware pipeline that combines VAE embeddings, geometric graph construction, and GNNs to enhance generalization in high-dimensional classification tasks.

Abstract

We propose a graph semi-supervised learning framework for classification tasks on data manifolds. Motivated by the manifold hypothesis, we model data as points sampled from a low-dimensional manifold $\mathcal{M} \subset \mathbb{R}^F$. The manifold is approximated in an unsupervised manner using a variational autoencoder (VAE), where the trained encoder maps data to embeddings that represent their coordinates in $\mathbb{R}^F$. A geometric graph is constructed with Gaussian-weighted edges inversely proportional to distances in the embedding space, transforming the point classification problem into a semi-supervised node classification task on the graph. This task is solved using a graph neural network (GNN). Our main contribution is a theoretical analysis of the statistical generalization properties of this data-to-manifold-to-graph pipeline. We show that, under uniform sampling from $\mathcal{M}$, the generalization gap of the semi-supervised task diminishes with increasing graph size, up to the GNN training error. Leveraging a training procedure which resamples a slightly larger graph at regular intervals during training, we then show that the generalization gap can be reduced even further, vanishing asymptotically. Finally, we validate our findings with numerical experiments on image classification benchmarks, demonstrating the empirical effectiveness of our approach.

Graph Semi-Supervised Learning for Point Classification on Data Manifolds

TL;DR

This work addresses point classification on data manifolds by embedding data with a variational autoencoder to approximate the manifold , constructing a Gaussian-geometric graph in embedding space, and applying a graph neural network (GNN) for semi-supervised node classification. The authors develop a theoretical framework showing that the semi-supervised generalization gap on geometric graphs shrinks as the graph size grows, and they introduce a growing-graph training scheme to drive the gap toward zero asymptotically, connecting graph-based learning with manifold geometry. Empirically, the method yields strong image classification performance across benchmarks (e.g., MNIST, FMNIST, CIFAR10, FER2013, CelebA, PathMNIST) and demonstrates reduced generalization gaps when training on sequences of increasing graphs, while VAEs provide superior latent structure versus PCA baselines. Overall, the paper offers a principled, geometry-aware pipeline that combines VAE embeddings, geometric graph construction, and GNNs to enhance generalization in high-dimensional classification tasks.

Abstract

We propose a graph semi-supervised learning framework for classification tasks on data manifolds. Motivated by the manifold hypothesis, we model data as points sampled from a low-dimensional manifold . The manifold is approximated in an unsupervised manner using a variational autoencoder (VAE), where the trained encoder maps data to embeddings that represent their coordinates in . A geometric graph is constructed with Gaussian-weighted edges inversely proportional to distances in the embedding space, transforming the point classification problem into a semi-supervised node classification task on the graph. This task is solved using a graph neural network (GNN). Our main contribution is a theoretical analysis of the statistical generalization properties of this data-to-manifold-to-graph pipeline. We show that, under uniform sampling from , the generalization gap of the semi-supervised task diminishes with increasing graph size, up to the GNN training error. Leveraging a training procedure which resamples a slightly larger graph at regular intervals during training, we then show that the generalization gap can be reduced even further, vanishing asymptotically. Finally, we validate our findings with numerical experiments on image classification benchmarks, demonstrating the empirical effectiveness of our approach.

Paper Structure

This paper contains 20 sections, 11 theorems, 70 equations, 3 figures, 5 tables.

Key Result

Proposition 3.1

Let $\Phi_{\mathcal{W}}$ be an MNN on the $d$-dimensional manifold ${\mathcal{M}}$. Let $\{u_1,\ldots,u_n\}$ be a set of points sampled uniformly from ${\mathcal{M}}$ and $L_n$ the corresponding geometric graph Laplacian. Define the map ${\mathcal{P}}_n: {\mathcal{X}} \mapsto X_n$: Suppose Assumptions ass:lipschitz_filter_main_body--ass:lipschitz_act_fn_main_body (stated in Section sec:main_matte

Figures (3)

  • Figure 1: (a) Framework schematic. We start by constructing VAE embeddings (1), computing their pairwise distances to form manifolds (2), and sampling graphs from the manifolds (3). GNNs are trained on these graphs to leverage geometric information for image classification (4). (b) Setup for Theorems \ref{['thm:ga_bound_1']}--\ref{['thm:MNN_learning']} and Corollary \ref{['thm:main']}.
  • Figure 2: Generalization gap relative to training accuracy for (a) MNIST, (b) FMNIST, (c) CIFAR10, (d) FER2013. We compare an MLP trained on the VAE embeddings of the full dataset (red); GNNs fully trained on subgraphs of the full data graph with size given by the $x$-axis (blue, Thm. \ref{['thm:ga_bound_2']}); and a GNN learned on this sequence of subgraphs, one per epoch (green, Cor. \ref{['thm:main']}). The generalization gap decreases with graph size (blue), and is substantially smaller when training on growing subgraphs (green), in line with our theoretical predictions.
  • Figure 3: Generalization gap relative to training accuracy for (a) CelebA-Smiling, (b) CelebA-Gender, (c) PathMNIST. We compare an MLP trained on the VAE embeddings of the full dataset (red); GNNs fully trained on subgraphs of the full data graph with size given by the $x$-axis (blue, Thm. \ref{['thm:ga_bound_2']}); and a GNN learned on this sequence of subgraphs, one per epoch (green, Cor. \ref{['thm:main']}). The generalization gap decreases with graph size (blue) and is substantially smaller when training on growing subgraphs (green), in line with our theoretical predictions.

Theorems & Definitions (20)

  • Proposition 3.1: wang2024manifold, simplified
  • Theorem 4.4: An unsatisfactory generalization bound
  • proof
  • Theorem 4.5: A satisfactory generalization bound
  • proof
  • Theorem 4.6
  • proof
  • Corollary 4.7: A better generalization bound
  • Lemma B.3
  • proof
  • ...and 10 more