Table of Contents
Fetching ...

Budgeted Online Continual Learning by Adaptive Layer Freezing and Frequency-based Sampling

Minhyuk Seo, Hyunseo Koh, Jonghyun Choi

TL;DR

This work tackles online continual learning under realistic resource constraints by arguing that fair evaluation requires measuring both computation and memory budgets as FLOPs per sample and total memory in Bytes. It introduces two core techniques: adaptive layer freezing (aL) that maximizes information gained per computation via Fisher Information, and Similarity-Aware Retrieval (SAR) that prioritizes under-learned, informative samples using use-frequency and class-wise gradient similarity. Empirical results across CIFAR-10/100, CLEAR-10/100, and ImageNet-1K show that aL-SAR outperforms state-of-the-art methods within the same total budget, while also reducing FLOPs, and extending to multi-modal large language models. The method provides a practical path for deploying online CL in real-world settings where both computation and memory are constrained, including applications to large-scale multimodal fine-tuning.

Abstract

The majority of online continual learning (CL) advocates single-epoch training and imposes restrictions on the size of replay memory. However, single-epoch training would incur a different amount of computations per CL algorithm, and the additional storage cost to store logit or model in addition to replay memory is largely ignored in calculating the storage budget. Arguing different computational and storage budgets hinder fair comparison among CL algorithms in practice, we propose to use floating point operations (FLOPs) and total memory size in Byte as a metric for computational and memory budgets, respectively, to compare and develop CL algorithms in the same 'total resource budget.' To improve a CL method in a limited total budget, we propose adaptive layer freezing that does not update the layers for less informative batches to reduce computational costs with a negligible loss of accuracy. In addition, we propose a memory retrieval method that allows the model to learn the same amount of knowledge as using random retrieval in fewer iterations. Empirical validations on the CIFAR-10/100, CLEAR-10/100, and ImageNet-1K datasets demonstrate that the proposed approach outperforms the state-of-the-art methods within the same total budget

Budgeted Online Continual Learning by Adaptive Layer Freezing and Frequency-based Sampling

TL;DR

This work tackles online continual learning under realistic resource constraints by arguing that fair evaluation requires measuring both computation and memory budgets as FLOPs per sample and total memory in Bytes. It introduces two core techniques: adaptive layer freezing (aL) that maximizes information gained per computation via Fisher Information, and Similarity-Aware Retrieval (SAR) that prioritizes under-learned, informative samples using use-frequency and class-wise gradient similarity. Empirical results across CIFAR-10/100, CLEAR-10/100, and ImageNet-1K show that aL-SAR outperforms state-of-the-art methods within the same total budget, while also reducing FLOPs, and extending to multi-modal large language models. The method provides a practical path for deploying online CL in real-world settings where both computation and memory are constrained, including applications to large-scale multimodal fine-tuning.

Abstract

The majority of online continual learning (CL) advocates single-epoch training and imposes restrictions on the size of replay memory. However, single-epoch training would incur a different amount of computations per CL algorithm, and the additional storage cost to store logit or model in addition to replay memory is largely ignored in calculating the storage budget. Arguing different computational and storage budgets hinder fair comparison among CL algorithms in practice, we propose to use floating point operations (FLOPs) and total memory size in Byte as a metric for computational and memory budgets, respectively, to compare and develop CL algorithms in the same 'total resource budget.' To improve a CL method in a limited total budget, we propose adaptive layer freezing that does not update the layers for less informative batches to reduce computational costs with a negligible loss of accuracy. In addition, we propose a memory retrieval method that allows the model to learn the same amount of knowledge as using random retrieval in fewer iterations. Empirical validations on the CIFAR-10/100, CLEAR-10/100, and ImageNet-1K datasets demonstrate that the proposed approach outperforms the state-of-the-art methods within the same total budget

Paper Structure

This paper contains 59 sections, 9 equations, 21 figures, 22 tables.

Figures (21)

  • Figure 1: Comparison of CL methods w/o total constraint (left) and w/ total constraint (right) on CIFAR-10 Gaussian setup. In the left plot, we compare CL methods with the same number of iterations and the same episodic memory size, i.e., conventional setup. In the right plot, we compare CL methods with the same training FLOPs and a fixed storage budget that includes both episodic memory and model storage, i.e., our total-constrained CL setup. Compared to the conventional setup, aL-SAR shows improved performance under the total-constrained setup, since it can utilize the saved computational cost for further training. $A_\text{AUC}$ and $A_\text{LAST}$ refer to the area under the curve of accuracy and last accuracy (i.e., accuracy at the end of CL), respectively.
  • Figure 2: Gradient update procedure of the proposed aL-SAR. The colors in the 'Similarity-Aware Retrieval' box denote different classes. (1) 'Retrieval Probability' is calculated using class similarity $S$ and discounted use frequency $c_i$, where $c_i$ tracks the number of times the $i^\text{th}$ sample has been used for training. (2) A batch is sampled from memory by the 'Retrieval Probability' and (3) $c_i$ is updated by retrieval results. After the (4) forward pass of the model with the batch, (5) we compute the freezing criterion $(\text{BFC})_n$ for each layer $n$ of the model, using Fisher Information (FI) and $\frac{\partial \ell}{\partial x_L}$ (Sec. \ref{['sec:freezing']}). (6) Layers 1 to $n_\text{max} = \mathop{\mathrm{arg\,max}}\limits_n(\text{BFC})_n$ (in this example, $n_\text{max} = 2$) are frozen in the (7) backward pass. (8) $S_{ij}$ and FI are updated using the gradient $\frac{\partial \ell}{\partial \theta}$ obtained from the backward pass.
  • Figure 3: Accuracy on Gaussian and Disjoint CL setup in CIFAR-10 and CIFAR-100 for a wide range of FLOPs per sample. aL-SAR outperforms all CL methods compared. We use a memory budget of 30.4MB.
  • Figure 4: Accuracy on Gaussian and Disjoint CL setup in CIFAR-10 and CIFAR-100 for various memory budget. aL-SAR outperforms all CL methods compared. We use a computational budget of 171.94 TFLOPs.
  • Figure 5: The estimated trace of Fisher Information for layers 8, 16, 24, and 32 of ResNet-32 on CIFAR-10 Gaussian Task setup, comparing the estimation used in aL-SAR and the estimation with a 16 times bigger sample size.
  • ...and 16 more figures