Table of Contents
Fetching ...

Sample-Efficient Linear Representation Learning from Non-IID Non-Isotropic Data

Thomas T. C. K. Zhang, Leonardo F. Toso, James Anderson, Nikolai Matni

TL;DR

This work introduces an adaptation of the popular alternating minimization-descent scheme, De-bias&Feature-Whiten, and establishes linear convergence to the optimal representation with noise level scaling down with the source data size, which leads to generalization bounds on the same order as an oracle empirical risk minimizer.

Abstract

A powerful concept behind much of the recent progress in machine learning is the extraction of common features across data from heterogeneous sources or tasks. Intuitively, using all of one's data to learn a common representation function benefits both computational effort and statistical generalization by leaving a smaller number of parameters to fine-tune on a given task. Toward theoretically grounding these merits, we propose a general setting of recovering linear operators $M$ from noisy vector measurements $y = Mx + w$, where the covariates $x$ may be both non-i.i.d. and non-isotropic. We demonstrate that existing isotropy-agnostic representation learning approaches incur biases on the representation update, which causes the scaling of the noise terms to lose favorable dependence on the number of source tasks. This in turn can cause the sample complexity of representation learning to be bottlenecked by the single-task data size. We introduce an adaptation, $\texttt{De-bias & Feature-Whiten}$ ($\texttt{DFW}$), of the popular alternating minimization-descent scheme proposed independently in Collins et al., (2021) and Nayer and Vaswani (2022), and establish linear convergence to the optimal representation with noise level scaling down with the $\textit{total}$ source data size. This leads to generalization bounds on the same order as an oracle empirical risk minimizer. We verify the vital importance of $\texttt{DFW}$ on various numerical simulations. In particular, we show that vanilla alternating-minimization descent fails catastrophically even for iid, but mildly non-isotropic data. Our analysis unifies and generalizes prior work, and provides a flexible framework for a wider range of applications, such as in controls and dynamical systems.

Sample-Efficient Linear Representation Learning from Non-IID Non-Isotropic Data

TL;DR

This work introduces an adaptation of the popular alternating minimization-descent scheme, De-bias&Feature-Whiten, and establishes linear convergence to the optimal representation with noise level scaling down with the source data size, which leads to generalization bounds on the same order as an oracle empirical risk minimizer.

Abstract

A powerful concept behind much of the recent progress in machine learning is the extraction of common features across data from heterogeneous sources or tasks. Intuitively, using all of one's data to learn a common representation function benefits both computational effort and statistical generalization by leaving a smaller number of parameters to fine-tune on a given task. Toward theoretically grounding these merits, we propose a general setting of recovering linear operators from noisy vector measurements , where the covariates may be both non-i.i.d. and non-isotropic. We demonstrate that existing isotropy-agnostic representation learning approaches incur biases on the representation update, which causes the scaling of the noise terms to lose favorable dependence on the number of source tasks. This in turn can cause the sample complexity of representation learning to be bottlenecked by the single-task data size. We introduce an adaptation, (), of the popular alternating minimization-descent scheme proposed independently in Collins et al., (2021) and Nayer and Vaswani (2022), and establish linear convergence to the optimal representation with noise level scaling down with the source data size. This leads to generalization bounds on the same order as an oracle empirical risk minimizer. We verify the vital importance of on various numerical simulations. In particular, we show that vanilla alternating-minimization descent fails catastrophically even for iid, but mildly non-isotropic data. Our analysis unifies and generalizes prior work, and provides a flexible framework for a wider range of applications, such as in controls and dynamical systems.
Paper Structure (27 sections, 28 theorems, 134 equations, 4 figures, 1 algorithm)

This paper contains 27 sections, 28 theorems, 134 equations, 4 figures, 1 algorithm.

Key Result

Theorem 1.1

Let $\hat{\Phi}$ be the current estimate of the representation, and $\Phi_\star$ the optimal representation. Running one iteration of DFW yields the following improvement

Figures (4)

  • Figure 1: We plot the suboptimality the current and ground truth representation with respect to the number of iterations, comparing between the single and multiple-task settings of Algorithm \ref{['alg: multi-task alt min descent']} and the multi-task alternating minimization-descent. We observe performance improvement and variance reduction for multi-task DFW as predicted. All curves are are plotted as the mean with 95% confidence regions shaded
  • Figure 2: We plot the subspace distance between the current and ground truth representation with respect to the number of iterations, comparing between the single and multiple-task settings of Algorithm \ref{['alg: multi-task alt min descent']} and the multi-task FedRep for the IID linear regression with random covariance. We observe performance improvement and variance reduction for multi-task DFW as predicted.
  • Figure 3: We plot the subspace distance between the current and ground truth representation with respect to the number of iterations, comparing between the single and multiple-task settings of Algorithm \ref{['alg: multi-task alt min descent']} multi-task FedRep for the linear system identification with random covariance. We observe performance improvement and variance reduction for multi-task DFW as predicted.
  • Figure 4: We plot the subspace distance between the current and ground truth representation with respect to the number of iterations, comparing between the single and multiple-task settings of Algorithm \ref{['alg: multi-task alt min descent']} and the multi-task FedRep for the imitation learning with random covariance. We observe performance improvement and variance reduction for multi-task DFW as predicted.

Theorems & Definitions (36)

  • Theorem 1.1: main result, informal
  • Definition 2.1: Subspace Distance stewart1990matrixcollins2021exploiting
  • Remark 3.1: Choice of weights $\mathopen{}\hat{F}^{(t)}\mathclose{}$ vs. descent rate
  • Definition 3.1: $\beta$-mixing
  • Definition 3.2: Task diversity
  • Lemma 3.1: Contraction factor bound
  • Proposition 3.1: Noise term bound
  • Lemma 3.2
  • Theorem 3.1: Main result
  • Remark 3.2: Initialization
  • ...and 26 more