NeuroFlux: Memory-Efficient CNN Training Using Adaptive Local Learning

Dhananjay Saikumar; Blesson Varghese

NeuroFlux: Memory-Efficient CNN Training Using Adaptive Local Learning

Dhananjay Saikumar, Blesson Varghese

TL;DR

NeuroFlux addresses the challenge of memory-inefficient on-device CNN training by replacing end-to-end Backpropagation with adaptive local learning. It introduces adaptive auxiliary networks and adaptive batch sizes within a block-based training framework, enabling memory-budgeted training on edge devices while preserving accuracy. The system caches activations, partitions networks into memory-homogeneous blocks, and selects compact early-exit CNNs, delivering 2.3×–6.1× training speedups over BP and 10.9×–29.4× fewer parameters in the final model, with 1.61×–3.95× inference throughput gains. This approach expands the practicality of on-device learning for privacy-preserving and personalized AI at the edge, with potential extensions to speech, transformers, and federated settings.

Abstract

Efficient on-device Convolutional Neural Network (CNN) training in resource-constrained mobile and edge environments is an open challenge. Backpropagation is the standard approach adopted, but it is GPU memory intensive due to its strong inter-layer dependencies that demand intermediate activations across the entire CNN model to be retained in GPU memory. This necessitates smaller batch sizes to make training possible within the available GPU memory budget, but in turn, results in substantially high and impractical training time. We introduce NeuroFlux, a novel CNN training system tailored for memory-constrained scenarios. We develop two novel opportunities: firstly, adaptive auxiliary networks that employ a variable number of filters to reduce GPU memory usage, and secondly, block-specific adaptive batch sizes, which not only cater to the GPU memory constraints but also accelerate the training process. NeuroFlux segments a CNN into blocks based on GPU memory usage and further attaches an auxiliary network to each layer in these blocks. This disrupts the typical layer dependencies under a new training paradigm - $\textit{`adaptive local learning'}$. Moreover, NeuroFlux adeptly caches intermediate activations, eliminating redundant forward passes over previously trained blocks, further accelerating the training process. The results are twofold when compared to Backpropagation: on various hardware platforms, NeuroFlux demonstrates training speed-ups of 2.3$\times$ to 6.1$\times$ under stringent GPU memory budgets, and NeuroFlux generates streamlined models that have 10.9$\times$ to 29.4$\times$ fewer parameters.

NeuroFlux: Memory-Efficient CNN Training Using Adaptive Local Learning

TL;DR

Abstract

. Moreover, NeuroFlux adeptly caches intermediate activations, eliminating redundant forward passes over previously trained blocks, further accelerating the training process. The results are twofold when compared to Backpropagation: on various hardware platforms, NeuroFlux demonstrates training speed-ups of 2.3

to 6.1

under stringent GPU memory budgets, and NeuroFlux generates streamlined models that have 10.9

to 29.4

fewer parameters.

Paper Structure (27 sections, 12 equations, 14 figures, 3 tables, 2 algorithms)

This paper contains 27 sections, 12 equations, 14 figures, 3 tables, 2 algorithms.

Introduction
Background and Motivation
On-device Training
Backpropagation is Memory Intensive
Training Paradigms Beyond Backpropagation
The Case for Adaptive Local Learning
NeuroFlux Overview
Design
Representing Local Learning
Modules
Efficient Forward Propagation via Prefetching and Adaptive Batching
Trained Output CNN
Evaluation
Experimental Setup
End-to-End Training Performance
...and 12 more sections

Figures (14)

Figure 1: Comparison of GPU memory usage and relative training time for different architectures and batch sizes on the Tiny ImageNet dataset. The top row shows memory used by activations, the model, and the optimizer, with multipliers indicating memory required relative to inference. The bottom row highlights training time relative to batch size of 256.
Figure 2: Comparison of BP and LL. BP relies on a global loss, with updates for each layer dependent on subsequent layers. In contrast, LL pairs each layer (excluding the last) with an auxiliary network for independent updates using local losses, thereby eliminating backward feedback dependencies.
Figure 3: GPU memory required and accuracy achieved by different training paradigms. The blue-shaded quadrant represents the ideal zone for a training paradigm (low GPU memory utilization and high accuracy).
Figure 4: GPU memory usage of VGG-19 for inference, Backpropagation (BP), classic Local Learning (LL) with a constant number of 256 convolutional filters LL_Vanilla and the proposed Adaptive Auxiliary Networks-based LL (AAN-LL) for different batch sizes.
Figure 5: GPU memory usage for training VGG-19 with a batch size of 30 images using AAN-LL. 'Unused Memory' area refers to GPU memory not utilized by each layer.
...and 9 more figures

NeuroFlux: Memory-Efficient CNN Training Using Adaptive Local Learning

TL;DR

Abstract

NeuroFlux: Memory-Efficient CNN Training Using Adaptive Local Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)