Minimally Supervised Learning using Topological Projections in Self-Organizing Maps
Zimeng Lyu, Alexander Ororbia, Rui Li, Travis Desell
TL;DR
The paper addresses parameter prediction when labeled data are scarce by introducing a minimally supervised framework based on self-organizing maps (SOMs). It trains SOMs on large unlabeled datasets, maps a small labeled set to BMUs, and predicts unseen data using topological distances and neighbor-based projections, notably a weighted-average approach computed via $e_p = rac{ extstyle igl( ext{ }pigr) rac{1}{d(BMU,n)}}{ extstyle rac{1}{d(BMU,n)}}$. Across coal spectra and appliance-energy datasets, the SOM-based topological projection—especially the weighted-average variant—consistently outperforms classical regression, Gaussian process regression, deep neural networks, KNNs, and DBSCAN, particularly under scarce labeling. The method yields strong performance in high-dimensional settings, provides visualizable topology through the U-Matrix, and demonstrates practical value for domains where labeling is expensive, suggesting broad applicability and avenues for explainability and further topology-based projections.
Abstract
Parameter prediction is essential for many applications, facilitating insightful interpretation and decision-making. However, in many real life domains, such as power systems, medicine, and engineering, it can be very expensive to acquire ground truth labels for certain datasets as they may require extensive and expensive laboratory testing. In this work, we introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs), which significantly reduces the required number of labeled data points to perform parameter prediction, effectively exploiting information contained in large unlabeled datasets. Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU). The values estimated for newly-encountered data points are computed utilizing the average of the $n$ closest labeled data points in the SOM's U-matrix in tandem with a topological shortest path distance calculation scheme. Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques, including linear and polynomial regression, Gaussian process regression, K-nearest neighbors, as well as deep neural network models and related clustering schemes.
