Table of Contents
Fetching ...

Adapting to Online Distribution Shifts in Deep Learning: A Black-Box Approach

Dheeraj Baby, Boran Han, Shuai Zhang, Cuixiong Hu, Yuyang Wang, Yu-Xiang Wang

TL;DR

This work tackles online distribution shifts by proposing AWE, a black-box meta-algorithm that wraps any online learner to improve performance under non-stationarity. The method combines Multi-Resolution Instaces (MRI) to maintain a logarithmic pool of models and Cross-Validation-Through-Time (CVTT) to refine accuracy estimates and weight models adaptively. The authors provide regret and generalization guarantees, including data-coverage properties ensuring relevant recent data is represented, and demonstrate empirical gains on real-world, non-stationary text and image datasets. The approach enables adaptive attention to the most relevant historical data without convexity assumptions, offering practical advantages for deep-learning pipelines facing distribution shifts.

Abstract

We study the well-motivated problem of online distribution shift in which the data arrive in batches and the distribution of each batch can change arbitrarily over time. Since the shifts can be large or small, abrupt or gradual, the length of the relevant historical data to learn from may vary over time, which poses a major challenge in designing algorithms that can automatically adapt to the best ``attention span'' while remaining computationally efficient. We propose a meta-algorithm that takes any network architecture and any Online Learner (OL) algorithm as input and produces a new algorithm which provably enhances the performance of the given OL under non-stationarity. Our algorithm is efficient (it requires maintaining only $O(\log(T))$ OL instances) and adaptive (it automatically chooses OL instances with the ideal ``attention'' length at every timestamp). Experiments on various real-world datasets across text and image modalities show that our method consistently improves the accuracy of user specified OL algorithms for classification tasks. Key novel algorithmic ingredients include a \emph{multi-resolution instance} design inspired by wavelet theory and a cross-validation-through-time technique. Both could be of independent interest.

Adapting to Online Distribution Shifts in Deep Learning: A Black-Box Approach

TL;DR

This work tackles online distribution shifts by proposing AWE, a black-box meta-algorithm that wraps any online learner to improve performance under non-stationarity. The method combines Multi-Resolution Instaces (MRI) to maintain a logarithmic pool of models and Cross-Validation-Through-Time (CVTT) to refine accuracy estimates and weight models adaptively. The authors provide regret and generalization guarantees, including data-coverage properties ensuring relevant recent data is represented, and demonstrate empirical gains on real-world, non-stationary text and image datasets. The approach enables adaptive attention to the most relevant historical data without convexity assumptions, offering practical advantages for deep-learning pipelines facing distribution shifts.

Abstract

We study the well-motivated problem of online distribution shift in which the data arrive in batches and the distribution of each batch can change arbitrarily over time. Since the shifts can be large or small, abrupt or gradual, the length of the relevant historical data to learn from may vary over time, which poses a major challenge in designing algorithms that can automatically adapt to the best ``attention span'' while remaining computationally efficient. We propose a meta-algorithm that takes any network architecture and any Online Learner (OL) algorithm as input and produces a new algorithm which provably enhances the performance of the given OL under non-stationarity. Our algorithm is efficient (it requires maintaining only OL instances) and adaptive (it automatically chooses OL instances with the ideal ``attention'' length at every timestamp). Experiments on various real-world datasets across text and image modalities show that our method consistently improves the accuracy of user specified OL algorithms for classification tasks. Key novel algorithmic ingredients include a \emph{multi-resolution instance} design inspired by wavelet theory and a cross-validation-through-time technique. Both could be of independent interest.

Paper Structure

This paper contains 17 sections, 7 theorems, 25 equations, 19 figures, 4 tables, 3 algorithms.

Key Result

Theorem 1

Suppose we are at the beginning of a timestamp $t+1$ and the data distribution has remained constant from some round $t_0 < t+1$. Let this distribution be $\mathcal{D}$. We have labelled hold-out data available till round $t$. There exists at least one instance in the MRI pool that is active at a gi

Figures (19)

  • Figure 1: The figure shows the configuration of Multi Resolution Instances (MRI). Brackets of type $[$$]$ belongs to the collection $R$ and type $\{$$\}$ belongs to collection $B$ (see Sec. \ref{['subsec:mri']}). Consider the scenario where the data distribution has changed from timestamp $3$ and remained stable afterwards. Suppose we are at the beginning of round $9$ and after each round we get $n$ training data points. So we have seen $6n$ labelled data points from distribution $\mathcal{D}_2$. $\text{ACTIVE}(9)$ corresponds to those intervals that include the timestamp $9$. The circled intervals has seen at least $3n$ data points from distribution $\mathcal{D}_2$ thereby ensuring models that are present in the active set with good performance under distribution $\mathcal{D}_2$. A formal result of the data utilization efficiency of the MRI construction is proved in Theorem \ref{['thm:mri']}.
  • Figure 2: $\%$ accuracy differences across various timestamps when AWE is run with SI as the online learning algorithm. We report similar results for other OL algorithms and the fraction of timestamps where AWE improves (or does not degrade) the performance of the base OL in Appendix \ref{['app:exp']}.
  • Figure 3: Ablation study across various resolutions. We compute the overall accuracy attained by AWE minus that attained by using only a single resolution in the MRI pool. We see that in most cases AWE outperforms the single resolution counterparts. Further, by virtue of using AWE, the user does not need to hand-tune the optimal resolution to use in an MRI pool.
  • Figure 4: $\%$ accuracy differences across various timestamps when AWE is run with FT as the online learning algorithm.
  • Figure 5: $\%$ accuracy differences across various timestamps when AWE is run with EWC as the online learning algorithm.
  • ...and 14 more figures

Theorems & Definitions (17)

  • Theorem 1
  • Theorem 2
  • Remark 3
  • Remark 4
  • Theorem 4
  • proof
  • Definition 5
  • Theorem 5
  • proof
  • Proposition 6
  • ...and 7 more