Table of Contents
Fetching ...

In-Context Semi-Supervised Learning

Jiashuo Fan, Paul Rosu, Aaron T. Wang, Michael Li, Lawrence Carin, Xiang Cheng

TL;DR

The paper addresses the scarcity of labels in in-context learning by proposing in-context semi-supervised learning (IC-SSL), where a Transformer first learns a geometry-aware representation from unlabeled data and then performs in-context supervised inference with limited labels. It introduces a two-stage Transformer: a representation-learning stage that computes Laplacian-based eigenmaps and a second stage that implements gradient-descent-like in-context learning for categorical predictions, trained end-to-end but built with a mechanistic, interpretable bias. Across synthetic manifolds, product manifolds, and image manifolds (including ImageNet100), the approach yields strong low-label performance and robust out-of-distribution transfer, outperforming baselines that rely on offline Laplacian embeddings or plain ICL. The work demonstrates that Transformers can extract and exploit unlabeled geometric structure in-context, offering a principled, geometry-aware view of how attention and MLPs realize semi-supervised inference with limited supervision. These results provide a foundation for understanding and leveraging unlabeled context in scalable, cross-domain Transformer applications.

Abstract

There has been significant recent interest in understanding the capacity of Transformers for in-context learning (ICL), yet most theory focuses on supervised settings with explicitly labeled pairs. In practice, Transformers often perform well even when labels are sparse or absent, suggesting crucial structure within unlabeled contextual demonstrations. We introduce and study in-context semi-supervised learning (IC-SSL), where a small set of labeled examples is accompanied by many unlabeled points, and show that Transformers can leverage the unlabeled context to learn a robust, context-dependent representation. This representation enables accurate predictions and markedly improves performance in low-label regimes, offering foundational insights into how Transformers exploit unlabeled context for representation learning within the ICL framework.

In-Context Semi-Supervised Learning

TL;DR

The paper addresses the scarcity of labels in in-context learning by proposing in-context semi-supervised learning (IC-SSL), where a Transformer first learns a geometry-aware representation from unlabeled data and then performs in-context supervised inference with limited labels. It introduces a two-stage Transformer: a representation-learning stage that computes Laplacian-based eigenmaps and a second stage that implements gradient-descent-like in-context learning for categorical predictions, trained end-to-end but built with a mechanistic, interpretable bias. Across synthetic manifolds, product manifolds, and image manifolds (including ImageNet100), the approach yields strong low-label performance and robust out-of-distribution transfer, outperforming baselines that rely on offline Laplacian embeddings or plain ICL. The work demonstrates that Transformers can extract and exploit unlabeled geometric structure in-context, offering a principled, geometry-aware view of how attention and MLPs realize semi-supervised inference with limited supervision. These results provide a foundation for understanding and leveraging unlabeled context in scalable, cross-domain Transformer applications.

Abstract

There has been significant recent interest in understanding the capacity of Transformers for in-context learning (ICL), yet most theory focuses on supervised settings with explicitly labeled pairs. In practice, Transformers often perform well even when labels are sparse or absent, suggesting crucial structure within unlabeled contextual demonstrations. We introduce and study in-context semi-supervised learning (IC-SSL), where a small set of labeled examples is accompanied by many unlabeled points, and show that Transformers can leverage the unlabeled context to learn a robust, context-dependent representation. This representation enables accurate predictions and markedly improves performance in low-label regimes, offering foundational insights into how Transformers exploit unlabeled context for representation learning within the ICL framework.

Paper Structure

This paper contains 64 sections, 3 theorems, 62 equations, 10 figures, 7 tables.

Key Result

Lemma 1

Let $\hat{\mathop{\mathrm{\mathcal{L}}}\nolimits}\in \mathbb{R}^{n\times n}$ denote the right-normalized Laplacian with bandwidth $\sigma^2$ as defined in e:laplacian_appendix above. Consider the single-head Transformer, parameterized as in e:t:qoimdmald:0, with the following parameters: $s_\ell = 0

Figures (10)

  • Figure 1: Setup for semi-supervised in-context learning with a Transformer. Input Data consists of labeled pairs $\{(x^{(1)},y^{(1)}),\dots,(x^{(m)},y^{(m)})\}$, where $y^{(i)}\in\{{\textcolor{blue}{0}}, {\textcolor{red}{1}}\}$, and unlabeled data $\{x^{(m+1)},\dots,x^{(n)}\}$. In stage (A), a Transformer module takes as input $\{x^{(1)},\dots,x^{(n)}\}$ and outputs $\phi(x^{(i)})$, where $\phi(\cdot)$ is a context-dependent feature representation. In stage (B), we augment $\{\phi(x^{(1)}),\dots,\phi(x^{(n)})\}$ with the $m$observed labels. In stage (C), a second Transformer module takes this augmented set as input. In stage (D), the second Transformer module outputs prediction probabilities of $y^{(m+1)},\dots,y^{(n)}$ for each unlabeled $x^{(i)}$ in standard ICL fashion. The two Transformer modules combine into a single Transformer, trained end-to-end. As detailed in Section \ref{['s:transformer_construction']}, the first module is motivated by learning an eigenmap of the Laplacian, and the second module performs ICL with categorical observations by implementing gradient descent at inference time.
  • Figure 2: Synthetic manifolds we use. Left to right: Sphere $(\mathbb{S}^2)$, Right Circular Cylinder $(\text{C})$, Right Circular Cone $(\text{Cone}_\alpha)$, Archimedean Spiral ($\text{Swiss-Roll, SR}$), Flat Torus $(\mathbb{T}^2)$. The colors reflect the binary labels.
  • Figure 3: In-context learning on ImageNet100 with 3% labels; “dataset size” = number of constructed in-context tasks. Columns: left—ICL accuracy; middle—separation (intra- minus inter-class similarity); right—mNN neighborhood-overlap similarity. Methods: Orig+E2E-ICL (red), Transformer baseline (blue), Orig+ICL (green; VGG-29 features).
  • Figure 4: Accuracy vs. labeled-sample ratio on manifold benchmarks. (a) In-distribution: train/test on cylinder. (b) OOD (cylinder): blue Orig+E2E-ICL and green Orig+ICL are trained on {sphere, cone, torus, Swiss-roll}, tested on cylinder; orange is the ID reference (train/test on cylinder). (c) In-distribution (product manifold): training and testing on the high-dimensional product manifold ($\mathbb{S}^2 \times \text{C} \times \text{Cone}_\alpha \times \text{SR} \times \mathbb{T}^2$).
  • Figure 5: Results on image manifolds. Left: zero-shot test accuracy, where blue Orig+E2E-ICL and green Eig+ICL are trained only on four synthetic manifolds {sphere, cone, Swiss-roll, cylinder} and tested on image manifolds; red/orange denote ID runs (trained and tested on images). Baselines Eig+LR/Orig+RBF-LR are omitted for large underperformance. Right: an example manifold obtained by sub-sampling every 5th of $n=100$ SLERP frames between two random latents; top two rows are label 1, bottom two rows are label 0, illustrating the smooth geodesic manifold structure.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Remark 1
  • Lemma 1: Construction of Transformer for Computing Laplacian
  • proof : Proof of Lemma \ref{['l:laplacian_transformer_construction']}
  • Remark 2
  • Lemma 2: Construction of Transformer for computing Eigenmap.
  • Remark 3
  • proof
  • Theorem 1: Geodesic distance on a product manifold
  • proof
  • Remark 4
  • ...and 2 more