Sample-Efficient Linear Representation Learning from Non-IID Non-Isotropic Data

Thomas T. C. K. Zhang; Leonardo F. Toso; James Anderson; Nikolai Matni

Sample-Efficient Linear Representation Learning from Non-IID Non-Isotropic Data

Thomas T. C. K. Zhang, Leonardo F. Toso, James Anderson, Nikolai Matni

TL;DR

This work introduces an adaptation of the popular alternating minimization-descent scheme, De-bias&Feature-Whiten, and establishes linear convergence to the optimal representation with noise level scaling down with the source data size, which leads to generalization bounds on the same order as an oracle empirical risk minimizer.

Abstract

A powerful concept behind much of the recent progress in machine learning is the extraction of common features across data from heterogeneous sources or tasks. Intuitively, using all of one's data to learn a common representation function benefits both computational effort and statistical generalization by leaving a smaller number of parameters to fine-tune on a given task. Toward theoretically grounding these merits, we propose a general setting of recovering linear operators $M$ from noisy vector measurements $y = Mx + w$, where the covariates $x$ may be both non-i.i.d. and non-isotropic. We demonstrate that existing isotropy-agnostic representation learning approaches incur biases on the representation update, which causes the scaling of the noise terms to lose favorable dependence on the number of source tasks. This in turn can cause the sample complexity of representation learning to be bottlenecked by the single-task data size. We introduce an adaptation, $\texttt{De-bias & Feature-Whiten}$ ($\texttt{DFW}$), of the popular alternating minimization-descent scheme proposed independently in Collins et al., (2021) and Nayer and Vaswani (2022), and establish linear convergence to the optimal representation with noise level scaling down with the $\textit{total}$ source data size. This leads to generalization bounds on the same order as an oracle empirical risk minimizer. We verify the vital importance of $\texttt{DFW}$ on various numerical simulations. In particular, we show that vanilla alternating-minimization descent fails catastrophically even for iid, but mildly non-isotropic data. Our analysis unifies and generalizes prior work, and provides a flexible framework for a wider range of applications, such as in controls and dynamical systems.

Sample-Efficient Linear Representation Learning from Non-IID Non-Isotropic Data

TL;DR

Abstract

from noisy vector measurements

, where the covariates

may be both non-i.i.d. and non-isotropic. We demonstrate that existing isotropy-agnostic representation learning approaches incur biases on the representation update, which causes the scaling of the noise terms to lose favorable dependence on the number of source tasks. This in turn can cause the sample complexity of representation learning to be bottlenecked by the single-task data size. We introduce an adaptation,

(

), of the popular alternating minimization-descent scheme proposed independently in Collins et al., (2021) and Nayer and Vaswani (2022), and establish linear convergence to the optimal representation with noise level scaling down with the

source data size. This leads to generalization bounds on the same order as an oracle empirical risk minimizer. We verify the vital importance of

on various numerical simulations. In particular, we show that vanilla alternating-minimization descent fails catastrophically even for iid, but mildly non-isotropic data. Our analysis unifies and generalizes prior work, and provides a flexible framework for a wider range of applications, such as in controls and dynamical systems.

Paper Structure (27 sections, 28 theorems, 134 equations, 4 figures, 1 algorithm)

This paper contains 27 sections, 28 theorems, 134 equations, 4 figures, 1 algorithm.

Introduction
Contributions:
Related Work
Problem Formulation
Regression Model.
Multi-Task Operator Recovery.
Sample-Efficient Linear Representation Learning
Perils of (Vanilla) Gradient Descent on the Representation
A Task-Efficient Algorithm: De-bias & Feature-whiten
Algorithm Guarantees
Numerical Validation
Linear Regression with IID and Non-isotropic Data
Linear System Identification
Discussion and Future Work
Theoretical Analysis of DFW (\ref{['alg: multi-task alt min descent']})
...and 12 more sections

Key Result

Theorem 1.1

Let $\hat{\Phi}$ be the current estimate of the representation, and $\Phi_\star$ the optimal representation. Running one iteration of DFW yields the following improvement

Figures (4)

Figure 1: We plot the suboptimality the current and ground truth representation with respect to the number of iterations, comparing between the single and multiple-task settings of Algorithm \ref{['alg: multi-task alt min descent']} and the multi-task alternating minimization-descent. We observe performance improvement and variance reduction for multi-task DFW as predicted. All curves are are plotted as the mean with 95% confidence regions shaded
Figure 2: We plot the subspace distance between the current and ground truth representation with respect to the number of iterations, comparing between the single and multiple-task settings of Algorithm \ref{['alg: multi-task alt min descent']} and the multi-task FedRep for the IID linear regression with random covariance. We observe performance improvement and variance reduction for multi-task DFW as predicted.
Figure 3: We plot the subspace distance between the current and ground truth representation with respect to the number of iterations, comparing between the single and multiple-task settings of Algorithm \ref{['alg: multi-task alt min descent']} multi-task FedRep for the linear system identification with random covariance. We observe performance improvement and variance reduction for multi-task DFW as predicted.
Figure 4: We plot the subspace distance between the current and ground truth representation with respect to the number of iterations, comparing between the single and multiple-task settings of Algorithm \ref{['alg: multi-task alt min descent']} and the multi-task FedRep for the imitation learning with random covariance. We observe performance improvement and variance reduction for multi-task DFW as predicted.

Theorems & Definitions (36)

Theorem 1.1: main result, informal
Definition 2.1: Subspace Distance stewart1990matrixcollins2021exploiting
Remark 3.1: Choice of weights $\mathopen{}\hat{F}^{(t)}\mathclose{}$ vs. descent rate
Definition 3.1: $\beta$-mixing
Definition 3.2: Task diversity
Lemma 3.1: Contraction factor bound
Proposition 3.1: Noise term bound
Lemma 3.2
Theorem 3.1: Main result
Remark 3.2: Initialization
...and 26 more

Sample-Efficient Linear Representation Learning from Non-IID Non-Isotropic Data

TL;DR

Abstract

Sample-Efficient Linear Representation Learning from Non-IID Non-Isotropic Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (36)