Table of Contents
Fetching ...

Continual Deep Learning on the Edge via Stochastic Local Competition among Subnetworks

Theodoros Christophides, Kyriakos Tolias, Sotirios Chatzis

TL;DR

The paper tackles continual learning on resource-constrained edge devices by introducing TWTA-CIL, a stochastic local competition mechanism that partitions each layer into $I$ blocks of $J$ competing units to create task-specific sparse representations. A per-task winner posterior guides which units and associated weights are updated, with a Gumbel-Softmax relaxation enabling differentiable training; during inference, each block keeps only the winning unit, forming a task-specific ticket and dramatically reducing memory and FLOPs. The approach supports a convolutional variant (Conv-TWTA) and trains in a single cycle per task, achieving strong accuracy with far smaller deployment footprints than baselines like SparCL, LLT, and WSN. The results demonstrate improved performance and substantial efficiency gains, highlighting the method’s practicality for real-world edge continual learning tasks.

Abstract

Continual learning on edge devices poses unique challenges due to stringent resource constraints. This paper introduces a novel method that leverages stochastic competition principles to promote sparsity, significantly reducing deep network memory footprint and computational demand. Specifically, we propose deep networks that comprise blocks of units that compete locally to win the representation of each arising new task; competition takes place in a stochastic manner. This type of network organization results in sparse task-specific representations from each network layer; the sparsity pattern is obtained during training and is different among tasks. Crucially, our method sparsifies both the weights and the weight gradients, thus facilitating training on edge devices. This is performed on the grounds of winning probability for each unit in a block. During inference, the network retains only the winning unit and zeroes-out all weights pertaining to non-winning units for the task at hand. Thus, our approach is specifically tailored for deployment on edge devices, providing an efficient and scalable solution for continual learning in resource-limited environments.

Continual Deep Learning on the Edge via Stochastic Local Competition among Subnetworks

TL;DR

The paper tackles continual learning on resource-constrained edge devices by introducing TWTA-CIL, a stochastic local competition mechanism that partitions each layer into blocks of competing units to create task-specific sparse representations. A per-task winner posterior guides which units and associated weights are updated, with a Gumbel-Softmax relaxation enabling differentiable training; during inference, each block keeps only the winning unit, forming a task-specific ticket and dramatically reducing memory and FLOPs. The approach supports a convolutional variant (Conv-TWTA) and trains in a single cycle per task, achieving strong accuracy with far smaller deployment footprints than baselines like SparCL, LLT, and WSN. The results demonstrate improved performance and substantial efficiency gains, highlighting the method’s practicality for real-world edge continual learning tasks.

Abstract

Continual learning on edge devices poses unique challenges due to stringent resource constraints. This paper introduces a novel method that leverages stochastic competition principles to promote sparsity, significantly reducing deep network memory footprint and computational demand. Specifically, we propose deep networks that comprise blocks of units that compete locally to win the representation of each arising new task; competition takes place in a stochastic manner. This type of network organization results in sparse task-specific representations from each network layer; the sparsity pattern is obtained during training and is different among tasks. Crucially, our method sparsifies both the weights and the weight gradients, thus facilitating training on edge devices. This is performed on the grounds of winning probability for each unit in a block. During inference, the network retains only the winning unit and zeroes-out all weights pertaining to non-winning units for the task at hand. Thus, our approach is specifically tailored for deployment on edge devices, providing an efficient and scalable solution for continual learning in resource-limited environments.
Paper Structure (18 sections, 5 equations, 3 figures, 6 tables)

This paper contains 18 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: A detailed graphical illustration of the $i$-th block of a proposed TWTA layer (Section \ref{['model_form']}); for demonstration purposes, we choose $J=2$ competing units per block. Inputs $\bm{x}^{(t)}=\{x_{1}^{(t)},\dots,x_{E}^{(t)}\}$ are presented to each unit in the $i$-th block, when training on task $t$. Due to the TWTA mechanism, during forward passes through the network, only one competing unit propagates its output to the next layer; the rest are zeroed-out.
  • Figure 2: A detailed graphical illustration of a TWTA layer (Section \ref{['model_form']}); for demonstration purposes, we choose $I=2$ blocks with $J=3$ competing units per block. Inputs $\bm{x}^{(t)}=\{x_{1}^{(t)},\dots,x_{E}^{(t)}\}$ are presented to each unit in all blocks, when training on task $t$.
  • Figure 3: The convolutional TWTA variant (Section \ref{['conv_var']}); for demonstration purposes, we choose $J=2$ competing feature maps per kernel. Due to the TWTA mechanism, during forward passes through the network, only one competing feature map propagates its output to the next layer; the rest are zeroed-out.