Table of Contents
Fetching ...

Task Shift: From Classification to Regression in Overparameterized Linear Models

Tyler LaBonte, Kuo-Wei Lai, Vidya Muthukumar

TL;DR

This work investigates whether estimators trained on classification can generalize to regression in overparameterized linear models with Gaussian covariates, focusing on zero-shot and few-shot regimes. The authors leverage the minimum-norm interpolator framework and fine-grained parameter-level analysis under anisotropic covariance to show zero-shot task shift is generally impossible for both sparse and random signals, even under benign overfitting. They then introduce a simple, practical few-shot postprocessing method that first identifies the support of a sparse ground truth via attenuation patterns in the classification MNI and then runs a least-squares fit on that support using a small regression dataset, achieving a regression error of order $O\left(\frac{t}{m}\right)$ with $m$ regression samples. The results reveal a structured attenuation of classification signals that can be exploited for few-shot task shift and illuminate fundamental bias–task-shift tradeoffs in dense random-signal settings, with implications for understanding in-context learning and kernel/NTK regimes in high dimensions.

Abstract

Modern machine learning methods have recently demonstrated remarkable capability to generalize under task shift, where latent knowledge is transferred to a different, often more difficult, task under a similar data distribution. We investigate this phenomenon in an overparameterized linear regression setting where the task shifts from classification during training to regression during evaluation. In the zero-shot case, wherein no regression data is available, we prove that task shift is impossible in both sparse signal and random signal models for any Gaussian covariate distribution. In the few-shot case, wherein limited regression data is available, we propose a simple postprocessing algorithm which asymptotically recovers the ground-truth predictor. Our analysis leverages a fine-grained characterization of individual parameters arising from minimum-norm interpolation which may be of independent interest. Our results show that while minimum-norm interpolators for classification cannot transfer to regression a priori, they experience surprisingly structured attenuation which enables successful task shift with limited additional data.

Task Shift: From Classification to Regression in Overparameterized Linear Models

TL;DR

This work investigates whether estimators trained on classification can generalize to regression in overparameterized linear models with Gaussian covariates, focusing on zero-shot and few-shot regimes. The authors leverage the minimum-norm interpolator framework and fine-grained parameter-level analysis under anisotropic covariance to show zero-shot task shift is generally impossible for both sparse and random signals, even under benign overfitting. They then introduce a simple, practical few-shot postprocessing method that first identifies the support of a sparse ground truth via attenuation patterns in the classification MNI and then runs a least-squares fit on that support using a small regression dataset, achieving a regression error of order with regression samples. The results reveal a structured attenuation of classification signals that can be exploited for few-shot task shift and illuminate fundamental bias–task-shift tradeoffs in dense random-signal settings, with implications for understanding in-context learning and kernel/NTK regimes in high dimensions.

Abstract

Modern machine learning methods have recently demonstrated remarkable capability to generalize under task shift, where latent knowledge is transferred to a different, often more difficult, task under a similar data distribution. We investigate this phenomenon in an overparameterized linear regression setting where the task shifts from classification during training to regression during evaluation. In the zero-shot case, wherein no regression data is available, we prove that task shift is impossible in both sparse signal and random signal models for any Gaussian covariate distribution. In the few-shot case, wherein limited regression data is available, we propose a simple postprocessing algorithm which asymptotically recovers the ground-truth predictor. Our analysis leverages a fine-grained characterization of individual parameters arising from minimum-norm interpolation which may be of independent interest. Our results show that while minimum-norm interpolators for classification cannot transfer to regression a priori, they experience surprisingly structured attenuation which enables successful task shift with limited additional data.

Paper Structure

This paper contains 47 sections, 30 theorems, 233 equations, 6 figures, 2 algorithms.

Key Result

Lemma 7

Define $\bar{{\mathcal{S}}} \coloneqq {\mathcal{S}} \cup [k^\star]$ and denote by $\{\tilde{\lambda}_j\}_{j=1}^{d - |\bar{{\mathcal{S}}}|}$ the diagonal entries of the matrix $\bm{\Sigma}_{-\bar{{\mathcal{S}}}}$, i.e.,$\bm{\Sigma}$ with rows and columns indexed by $\bar{{\mathcal{S}}}$ left out, and with probability at least $1-cte^{-n^{2\epsilon}}$. When $n,d\to\infty$, the limit converges as al

Figures (6)

  • Figure 1: Task shift in language modeling and statistical estimation. In our task shift setting, latent knowledge is transferred between tasks under a similar conditional distribution or ground-truth signal. Task shift is compelling when the aim is to shift to a fundamentally harder task, with little to no data available from the new task.
  • Figure 2: Postprocessing achieves task shift even when minimum $\ell_2$-norm interpolation fails for both classification and regression. The left column demonstrates the survival of $t$-sparse signal support components in the classification MNI $\hat{{\bm{\theta}}}$ while non-support components decay quickly. The middle column shows the ${\mathcal{O}}\left(\frac{t}{m}\right)$ regression error of least-squares with reduction to $t$ dimensions using $m$ regression samples under standard Gaussian noise. Finally, the right column displays the regression risk of the classification MNI, regression MNI, and our postprocessed predictor. The signal ${\bm{\theta}}^\star$ is $2$-sparse with $a_1=1$ and $a_2=-0.5$ (see Assumption \ref{['asm:k_sparse']}). The middle column fixes $n=2500$. We plot the mean and standard deviation over $10$ draws of the training dataset ${\mathbf{X}}$. See Appendix \ref{['app:simulation']} for additional simulations.
  • Figure 3: Task shift for spiked covariance with $q < 1-r$ and signal outside the spike. We set $p=1.5$, $q=0.5$, and $r=0.25$ so that $q < 1-r$. Moreover, we add an additional signal component which lies outside the covariance spike for any $n\leq 2500$. Our task shift algorithm correctly recovers the support and generalizes well; note that the decay of the component outside the spike (index 8) is faster than those in the spike (indices 1-2), but still slower than those outside the support (indices 3-7 and 9-10). The true signal ${\bm{\theta}}^\star$ is $3$-sparse with $a_1=1$, $a_2=-0.5$, and $a_8=-0.15$ (see Assumption \ref{['asm:k_sparse']}). Our postprocessing algorithm uses top-$t$ support recovery and least-squares on noisy $m$-shot regression data. We plot the mean and standard deviation over $10$ draws of the training dataset ${\mathbf{X}}$.
  • Figure 4: Task shift for polynomial covariance with $u=0.25,v=0$. Our task shift estimator generalizes for polynomial covariance models. The true signal ${\bm{\theta}}^\star$ is $2$-sparse with $a_1=0.2$ and $a_2=-0.1$ (see Assumption \ref{['asm:k_sparse']}), and we set $d=n^{1.5}$. Note that this parameterization satisfies the conditions of Corollary \ref{['cor:poly_support_identification2']}. Our postprocessing algorithm uses top-$t$ support recovery and least-squares on noisy $m$-shot regression data. We plot the mean and standard deviation over $10$ draws of the training dataset ${\mathbf{X}}$.
  • Figure 5: Task shift for isotropic covariance $\bm{\Sigma}=50{\bm{I}}$. Our task shift estimator generalizes even in worst-case scenarios for minimum $\ell_2$-norm interpolation such as isotropic covariance. The true signal ${\bm{\theta}}^\star$ is $2$-sparse with $a_1=1$ and $a_2=-0.5$ (see Assumption \ref{['asm:k_sparse']}), and we set $d=n^{1.5}$. Our postprocessing algorithm uses top-$t$ support recovery and least-squares on noisy $m$-shot regression data. We plot the mean and standard deviation over $10$ draws of the training dataset ${\mathbf{X}}$.
  • ...and 1 more figures

Theorems & Definitions (60)

  • Definition 1: Effective rank
  • Definition 3: Spiked covariance matrix
  • Definition 4: Polynomial decay covariance matrix
  • Definition 6: Survival and contamination
  • Lemma 7
  • Theorem 8
  • Corollary 9
  • Lemma 10
  • Lemma 11
  • Theorem 12
  • ...and 50 more