Table of Contents
Fetching ...

When and How Unlabeled Data Provably Improve In-Context Learning

Yingcong Li, Xiangyu Chang, Muti Kara, Xiaofeng Liu, Amit Roy-Chowdhury, Samet Oymak

TL;DR

The loss landscape of one-layer linear attention models recover the optimal fully-supervised estimator but completely fail to exploit unlabeled data, and looping off-the-shelf tabular foundation models to enhance their semi-supervision capabilities are proposed.

Abstract

Recent research shows that in-context learning (ICL) can be effective even when demonstrations have missing or incorrect labels. To shed light on this capability, we examine a canonical setting where the demonstrations are drawn according to a binary Gaussian mixture model (GMM) and a certain fraction of the demonstrations have missing labels. We provide a comprehensive theoretical study to show that: (1) The loss landscape of one-layer linear attention models recover the optimal fully-supervised estimator but completely fail to exploit unlabeled data; (2) In contrast, multilayer or looped transformers can effectively leverage unlabeled data by implicitly constructing estimators of the form $\sum_{i\ge 0} a_i (X^\top X)^iX^\top y$ with $X$ and $y$ denoting features and partially-observed labels (with missing entries set to zero). We characterize the class of polynomials that can be expressed as a function of depth and draw connections to Expectation Maximization, an iterative pseudo-labeling algorithm commonly used in semi-supervised learning. Importantly, the leading polynomial power is exponential in depth, so mild amount of depth/looping suffices. As an application of theory, we propose looping off-the-shelf tabular foundation models to enhance their semi-supervision capabilities. Extensive evaluations on real-world datasets show that our method significantly improves the semisupervised tabular learning performance over the standard single pass inference.

When and How Unlabeled Data Provably Improve In-Context Learning

TL;DR

The loss landscape of one-layer linear attention models recover the optimal fully-supervised estimator but completely fail to exploit unlabeled data, and looping off-the-shelf tabular foundation models to enhance their semi-supervision capabilities are proposed.

Abstract

Recent research shows that in-context learning (ICL) can be effective even when demonstrations have missing or incorrect labels. To shed light on this capability, we examine a canonical setting where the demonstrations are drawn according to a binary Gaussian mixture model (GMM) and a certain fraction of the demonstrations have missing labels. We provide a comprehensive theoretical study to show that: (1) The loss landscape of one-layer linear attention models recover the optimal fully-supervised estimator but completely fail to exploit unlabeled data; (2) In contrast, multilayer or looped transformers can effectively leverage unlabeled data by implicitly constructing estimators of the form with and denoting features and partially-observed labels (with missing entries set to zero). We characterize the class of polynomials that can be expressed as a function of depth and draw connections to Expectation Maximization, an iterative pseudo-labeling algorithm commonly used in semi-supervised learning. Importantly, the leading polynomial power is exponential in depth, so mild amount of depth/looping suffices. As an application of theory, we propose looping off-the-shelf tabular foundation models to enhance their semi-supervision capabilities. Extensive evaluations on real-world datasets show that our method significantly improves the semisupervised tabular learning performance over the standard single pass inference.

Paper Structure

This paper contains 23 sections, 9 theorems, 112 equations, 2 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Let the prompt (cf. def Z) be generated as described in Section sec icl. Consider the objective (cf. obj att) with $L=1$ and squared loss function $\ell(y,\hat{y})=(y-\hat{y})^2$, and denote the optimal prediction as $y_{{\text{att}}\text{-}1}^\star(\bm{Z})$. Let $\hat{{\boldsymbol{\mu}}}_s$ represe Additionally, its classification error obeys where we define $\varepsilon_\sigma=\sigma/\sqrt{np}$

Figures (2)

  • Figure 1: Experimental results support our theoretical findings presented in Sections \ref{['sec one layer']} and \ref{['sec multilayer']}. In all three subfigures, blue, green, and orange markers represent the results of 1-, 2-, and 5-layer linear attention models, respectively. The SPI estimator (cf. \ref{['pred spi']}), SSPI-$1$, and SSPI-$\infty$ (cf. \ref{['pred sspi']}) are shown as blue solid, green solid, and green dotted curves, respectively. The red dotted curves in all subfigures correspond to the single-layer/SPI results described in Eq. \ref{['one layer err']} of Theorem \ref{['thm one layer']}, while the black dotted line in Fig. \ref{['fig diff n 10000']} corresponds to Eq. \ref{['n infty err']} of Theorem \ref{['thm optimal A']}. Additional details and discussion can be found in Sections \ref{['sec one layer']}, \ref{['sec multilayer']}, and \ref{['sec exp']}.
  • Figure 2: Additional experimental results. (a)&(b): Analysis of the optimal $\alpha$ values for the SSPI estimator (cf. \ref{['pred sspi']}) under varying $(n, p, k)$. Green solid and dotted curves represent optimal $\alpha$ values for SSPI-$1$ and SSPI-$\infty$, respectively. The SSPI results shown in Figure \ref{['fig multi layer']} use the corresponding $\alpha$ values from Figs. \ref{['fig diff m alpha']} and \ref{['fig diff n 10000 alpha']}. (c): Comparison of different model architectures for the SS-ICL problem. Dark blue and orange curves show results for 1-layer and 5-layer attention models, with solid and dashed curves representing linear and softmax attention, respectively. Cyan curves correspond to 5-layer Transformers. The black dotted curve shows the asymptotic Bayes-optimal error (cf. lelarge2019asymptotic). Results suggest the performance ordering: Transformer > linear attention > softmax attention. Further details are provided in Section \ref{['sec exp']}.

Theorems & Definitions (9)

  • Theorem 1
  • Proposition 1
  • Proposition 2
  • Lemma 1: Label $+$ Feature Propagation
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Lemma 2
  • Lemma 3