Bridged Clustering for Representation Learning: Semi-Supervised Sparse Bridging
Patrick Peixuan Ye, Chen Shani, Ellen Vitercik
TL;DR
Bridged Clustering tackles semi-supervised prediction when inputs $\mathcal{X}$ and outputs $\mathcal{Y}$ are available separately with only a small set of paired examples. It clusters each modality independently and learns a sparse, cluster-level bridge that maps input clusters to output clusters, enabling predictions by assigning a new $x$ to its input cluster and returning the centroid of the linked $\mathcal{Y}$ cluster. Theoretical guarantees connect mis-clustering rates $\varepsilon_X,\varepsilon_Y$ and mis-bridging rate $\varepsilon_B$ to overall risk, while empirical results across vision, language, and bioinformatics domains show competitive performance with strong label efficiency and linear-time inference. The approach emphasizes interpretability and efficiency, offering a practical avenue for leveraging large volumes of unpaired observations without dense supervision or heavy generative models.
Abstract
We introduce Bridged Clustering, a semi-supervised framework to learn predictors from any unpaired input $X$ and output $Y$ dataset. Our method first clusters $X$ and $Y$ independently, then learns a sparse, interpretable bridge between clusters using only a few paired examples. At inference, a new input $x$ is assigned to its nearest input cluster, and the centroid of the linked output cluster is returned as the prediction $\hat{y}$. Unlike traditional SSL, Bridged Clustering explicitly leverages output-only data, and unlike dense transport-based methods, it maintains a sparse and interpretable alignment. Through theoretical analysis, we show that with bounded mis-clustering and mis-bridging rates, our algorithm becomes an effective and efficient predictor. Empirically, our method is competitive with SOTA methods while remaining simple, model-agnostic, and highly label-efficient in low-supervision settings.
