Table of Contents
Fetching ...

TinyTL: Reduce Activations, Not Trainable Parameters for Efficient On-Device Learning

Han Cai, Chuang Gan, Ligeng Zhu, Song Han

TL;DR

TinyTL tackles the memory bottleneck of on-device learning by freezing pretrained feature-extractor weights and only training biases, complemented by lite residual modules that refine intermediate features with minimal activation overhead. The approach directly targets training memory rather than parameter count, achieving up to 6.5x savings (without feature-extractor adaptation) and up to 12.9x with adaptation via Once-for-AllBackbone. Through extensive experiments on 8 transfer tasks and facial-attribute benchmarks, TinyTL demonstrates strong memory savings with comparable or superior accuracy to full fine-tuning, and remains effective under batch-size-1 training. This work enables practical, private, on-device learning by dramatically reducing activation memory and computation without sacrificing performance.

Abstract

On-device learning enables edge devices to continually adapt the AI models to new data, which requires a small memory footprint to fit the tight memory constraint of edge devices. Existing work solves this problem by reducing the number of trainable parameters. However, this doesn't directly translate to memory saving since the major bottleneck is the activations, not parameters. In this work, we present Tiny-Transfer-Learning (TinyTL) for memory-efficient on-device learning. TinyTL freezes the weights while only learns the bias modules, thus no need to store the intermediate activations. To maintain the adaptation capacity, we introduce a new memory-efficient bias module, the lite residual module, to refine the feature extractor by learning small residual feature maps adding only 3.8% memory overhead. Extensive experiments show that TinyTL significantly saves the memory (up to 6.5x) with little accuracy loss compared to fine-tuning the full network. Compared to fine-tuning the last layer, TinyTL provides significant accuracy improvements (up to 34.1%) with little memory overhead. Furthermore, combined with feature extractor adaptation, TinyTL provides 7.3-12.9x memory saving without sacrificing accuracy compared to fine-tuning the full Inception-V3.

TinyTL: Reduce Activations, Not Trainable Parameters for Efficient On-Device Learning

TL;DR

TinyTL tackles the memory bottleneck of on-device learning by freezing pretrained feature-extractor weights and only training biases, complemented by lite residual modules that refine intermediate features with minimal activation overhead. The approach directly targets training memory rather than parameter count, achieving up to 6.5x savings (without feature-extractor adaptation) and up to 12.9x with adaptation via Once-for-AllBackbone. Through extensive experiments on 8 transfer tasks and facial-attribute benchmarks, TinyTL demonstrates strong memory savings with comparable or superior accuracy to full fine-tuning, and remains effective under batch-size-1 training. This work enables practical, private, on-device learning by dramatically reducing activation memory and computation without sacrificing performance.

Abstract

On-device learning enables edge devices to continually adapt the AI models to new data, which requires a small memory footprint to fit the tight memory constraint of edge devices. Existing work solves this problem by reducing the number of trainable parameters. However, this doesn't directly translate to memory saving since the major bottleneck is the activations, not parameters. In this work, we present Tiny-Transfer-Learning (TinyTL) for memory-efficient on-device learning. TinyTL freezes the weights while only learns the bias modules, thus no need to store the intermediate activations. To maintain the adaptation capacity, we introduce a new memory-efficient bias module, the lite residual module, to refine the feature extractor by learning small residual feature maps adding only 3.8% memory overhead. Extensive experiments show that TinyTL significantly saves the memory (up to 6.5x) with little accuracy loss compared to fine-tuning the full network. Compared to fine-tuning the last layer, TinyTL provides significant accuracy improvements (up to 34.1%) with little memory overhead. Furthermore, combined with feature extractor adaptation, TinyTL provides 7.3-12.9x memory saving without sacrificing accuracy compared to fine-tuning the full Inception-V3.

Paper Structure

This paper contains 28 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Left: The memory footprint required by training is much larger than inference. Right: Memory cost comparison between ResNet-50 and MobileNetV2-1.4 under batch size 16. Recent advances in efficient model design only reduce the size of parameters, but the activation size, which is the main bottleneck for training, does not improve much.
  • Figure 2: TinyTL overview ("C" denotes the width and "R" denote the resolution). Conventional transfer learning relies on fine-tuning the weights to adapt the model (Fig.a), which requires a large amount of activation memory (in blue) for back-propagation. TinyTL reduces the memory usage by fixing the weights (Fig.b) while only fine-tuning the bias. (Fig.c) exploit lite residual learning to compensate for the capacity loss, using group convolution and avoiding inverted bottleneck to achieve high arithmetic intensity and small memory footprint. The skip connection remains unchanged (omitted for simplicity).
  • Figure 3: Top1 accuracy results of different transfer learning methods under varied resolutions using the same pre-trained neural network (ProxylessNAS-Mobile). With the same level of accuracy, TinyTL achieves 3.9-6.5$\times$ memory saving compared to fine-tuning the full network.
  • Figure 4: Compared with the dynamic activation pruning liu2019dynamic, TinyTL saves the memory more effectively.
  • Figure 5: Results of TinyTL when trained with batch size 1. It further reduces the training memory footprint to around 16MB (typical L3 cache size), making it possible to train on the cache (SRAM) instead of DRAM.
  • ...and 2 more figures