Table of Contents
Fetching ...

Bridged Clustering for Representation Learning: Semi-Supervised Sparse Bridging

Patrick Peixuan Ye, Chen Shani, Ellen Vitercik

TL;DR

Bridged Clustering tackles semi-supervised prediction when inputs $\mathcal{X}$ and outputs $\mathcal{Y}$ are available separately with only a small set of paired examples. It clusters each modality independently and learns a sparse, cluster-level bridge that maps input clusters to output clusters, enabling predictions by assigning a new $x$ to its input cluster and returning the centroid of the linked $\mathcal{Y}$ cluster. Theoretical guarantees connect mis-clustering rates $\varepsilon_X,\varepsilon_Y$ and mis-bridging rate $\varepsilon_B$ to overall risk, while empirical results across vision, language, and bioinformatics domains show competitive performance with strong label efficiency and linear-time inference. The approach emphasizes interpretability and efficiency, offering a practical avenue for leveraging large volumes of unpaired observations without dense supervision or heavy generative models.

Abstract

We introduce Bridged Clustering, a semi-supervised framework to learn predictors from any unpaired input $X$ and output $Y$ dataset. Our method first clusters $X$ and $Y$ independently, then learns a sparse, interpretable bridge between clusters using only a few paired examples. At inference, a new input $x$ is assigned to its nearest input cluster, and the centroid of the linked output cluster is returned as the prediction $\hat{y}$. Unlike traditional SSL, Bridged Clustering explicitly leverages output-only data, and unlike dense transport-based methods, it maintains a sparse and interpretable alignment. Through theoretical analysis, we show that with bounded mis-clustering and mis-bridging rates, our algorithm becomes an effective and efficient predictor. Empirically, our method is competitive with SOTA methods while remaining simple, model-agnostic, and highly label-efficient in low-supervision settings.

Bridged Clustering for Representation Learning: Semi-Supervised Sparse Bridging

TL;DR

Bridged Clustering tackles semi-supervised prediction when inputs and outputs are available separately with only a small set of paired examples. It clusters each modality independently and learns a sparse, cluster-level bridge that maps input clusters to output clusters, enabling predictions by assigning a new to its input cluster and returning the centroid of the linked cluster. Theoretical guarantees connect mis-clustering rates and mis-bridging rate to overall risk, while empirical results across vision, language, and bioinformatics domains show competitive performance with strong label efficiency and linear-time inference. The approach emphasizes interpretability and efficiency, offering a practical avenue for leveraging large volumes of unpaired observations without dense supervision or heavy generative models.

Abstract

We introduce Bridged Clustering, a semi-supervised framework to learn predictors from any unpaired input and output dataset. Our method first clusters and independently, then learns a sparse, interpretable bridge between clusters using only a few paired examples. At inference, a new input is assigned to its nearest input cluster, and the centroid of the linked output cluster is returned as the prediction . Unlike traditional SSL, Bridged Clustering explicitly leverages output-only data, and unlike dense transport-based methods, it maintains a sparse and interpretable alignment. Through theoretical analysis, we show that with bounded mis-clustering and mis-bridging rates, our algorithm becomes an effective and efficient predictor. Empirically, our method is competitive with SOTA methods while remaining simple, model-agnostic, and highly label-efficient in low-supervision settings.

Paper Structure

This paper contains 44 sections, 5 equations, 11 figures, 2 tables, 2 algorithms.

Figures (11)

  • Figure 1: MSE distribution of different models in the inductive setting. The four distribution plots of the same color represent the settings with 1, 2, 3, and 4 supervised samples per cluster.
  • Figure 2: Tansductive Experiment: Best models in terms of lowest MSE, computed across 30 randomized trials per setting. Each bar represents the 30 trials of one setting. For example, if Bridged Clustering achieves the lowest MSE among all models in 15 out of the 30 trials for some setting, the bar that corresponds to that setting will be colored 50% blue. The 1,2,3,4 ticks on the bottom represent the settings with 1,2,3,4 supervised samples per cluster.
  • Figure 3: Inductive Experiment: Best models in terms of lowest MSE, computed across 30 randomized trials per setting. Each bar represents the 30 trials of one setting. For example, if Bridged Clustering achieves the lowest MSE among all models in 15 out of the 30 trials for some setting, the bar that corresponds to that setting will be colored 50% blue. The 1,2,3,4 ticks on the bottom represent the settings with 1,2,3,4 supervised samples per cluster.
  • Figure 4: MSE distribution of different models in the transductive setting. The 4 distribution plots of the same color represent the settings with 1,2,3,4 supervised samples per cluster.
  • Figure 5: MSE distribution of different models in reversed experiments in the transductive setting. The 4 distribution plots of the same color represent the settings with 1,2,3,4 supervised samples per cluster.
  • ...and 6 more figures