Table of Contents
Fetching ...

Bridging Lifelong and Multi-Task Representation Learning via Algorithm and Complexity Measure

Zhi Wang, Chicheng Zhang, Ramya Korlakai Vinayak

TL;DR

The paper addresses lifelong representation learning where tasks arrive sequentially and share a common representation. It introduces a simple, practical algorithm that alternates few-shot property tests (to reuse the current representation) with memory-backed multi-task ERM (to refine the representation when needed), and defines the task-eluder dimension to bound how often refinement is necessary. The main theoretical contribution is a finite-time bound showing that with finite capacities and a finite $\dim(\mathcal{H},\mathcal{F},\epsilon)$, the algorithm achieves $\epsilon$-excess risk for all $T$ tasks, with representation updates bounded by $O(\dim(\mathcal{H},\mathcal{F},\epsilon))$ and explicit sample/memory complexity expressions. The framework unifies lifelong learning with MTL/LTL by treating multi-task ERM as a mechanism to refine representations online, and it demonstrates practicality through synthetic and semi-synthetic experiments across regression and classification with both linear and deep representations. Overall, the work provides a principled, general theory for online transfer of representations under noise, with concrete guidance for implementing lifelong learning systems that scale to modern feature extractors and datasets.

Abstract

In lifelong learning, a learner faces a sequence of tasks with shared structure and aims to identify and leverage it to accelerate learning. We study the setting where such structure is captured by a common representation of data. Unlike multi-task learning or learning-to-learn, where tasks are available upfront to learn the representation, lifelong learning requires the learner to make use of its existing knowledge while continually gathering partial information in an online fashion. In this paper, we consider a generalized framework of lifelong representation learning. We propose a simple algorithm that uses multi-task empirical risk minimization as a subroutine and establish a sample complexity bound based on a new notion we introduce--the task-eluder dimension. Our result applies to a wide range of learning problems involving general function classes. As concrete examples, we instantiate our result on classification and regression tasks under noise.

Bridging Lifelong and Multi-Task Representation Learning via Algorithm and Complexity Measure

TL;DR

The paper addresses lifelong representation learning where tasks arrive sequentially and share a common representation. It introduces a simple, practical algorithm that alternates few-shot property tests (to reuse the current representation) with memory-backed multi-task ERM (to refine the representation when needed), and defines the task-eluder dimension to bound how often refinement is necessary. The main theoretical contribution is a finite-time bound showing that with finite capacities and a finite , the algorithm achieves -excess risk for all tasks, with representation updates bounded by and explicit sample/memory complexity expressions. The framework unifies lifelong learning with MTL/LTL by treating multi-task ERM as a mechanism to refine representations online, and it demonstrates practicality through synthetic and semi-synthetic experiments across regression and classification with both linear and deep representations. Overall, the work provides a principled, general theory for online transfer of representations under noise, with concrete guidance for implementing lifelong learning systems that scale to modern feature extractors and datasets.

Abstract

In lifelong learning, a learner faces a sequence of tasks with shared structure and aims to identify and leverage it to accelerate learning. We study the setting where such structure is captured by a common representation of data. Unlike multi-task learning or learning-to-learn, where tasks are available upfront to learn the representation, lifelong learning requires the learner to make use of its existing knowledge while continually gathering partial information in an online fashion. In this paper, we consider a generalized framework of lifelong representation learning. We propose a simple algorithm that uses multi-task empirical risk minimization as a subroutine and establish a sample complexity bound based on a new notion we introduce--the task-eluder dimension. Our result applies to a wide range of learning problems involving general function classes. As concrete examples, we instantiate our result on classification and regression tasks under noise.

Paper Structure

This paper contains 56 sections, 18 theorems, 87 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Proposition 3.3

Suppose we observe $n$ examples $\left\{(x_i, y_i)\right\}_{i=1}^n \sim P^n$, where $P$ denotes some noisy linear regression model in $\mathbb{R}^d$. Let $\mathcal{G}$ be a class of linear predictors in $\mathbb{R}^d$ and $\mathcal{G}_0 \subset \mathcal{G}$ be restricted to a fixed subspace of dimen There exists some constant $c$ such that no test can successfully distinguish between $H_0$ and $H_

Figures (4)

  • Figure 1.1: Our algorithm maintains a representation $\hat{h}$. When a new task arrives, the algorithm first performs a few-shot property test to check whether $\hat{h}$ admits a prediction layer with low excess risk. If not, it performs MTL on data from a subset of previously seen tasks and updates $\hat{h}$.
  • Figure 7.1: Results under different noise levels. For each noise level, the top plot shows the average cumulative number of samples used over $50$ tasks for each value of $k$, and the bottom plot shows how the cumulative number of representation updates evolve over the tasks. Shaded regions denote one standard deviation.
  • Figure 7.2: Performance on semi-synthetic experiments with MNIST digits. (a) The solid curve shows, on average, how the number of representation updates increases over $50$ binary digit classification tasks, with the shaded area showing one standard deviation. The dashed line represents linear growth. (b) Each box plot shows the distribution of $0$-$1$ errors of the $50$ produced predictors when evaluated on held-out data from the MNIST test set in one of the $10$ independent trials.
  • Figure 7.3: Performance on semi-synthetic experiments with CIFAR-10 images. (a) The solid curve shows, on average, how the number of representation updates increases over $50$ image classification tasks, with the shaded area showing one standard deviation. The dashed line represents linear growth. (b) Each box plot shows the distribution of $0$-$1$ errors of the produced predictors when evaluated on held-out data from the CIFAR-10 test set in one of the $10$ independent trials.

Theorems & Definitions (26)

  • Remark 2.2: Comparison with prior work
  • Remark 3.2
  • Proposition 3.3: informal
  • Remark 3.4
  • Definition 3.5: $\epsilon$-independence
  • Definition 3.6: Task-eluder dimension
  • Proposition 3.7
  • Theorem 4.1: baxter2000model, Theorem 4 and Theorem 6 thereof
  • Corollary 4.2
  • Theorem 4.3
  • ...and 16 more