ReDistill: Residual Encoded Distillation for Peak Memory Reduction of CNNs

Fang Chen; Gourav Datta; Mujahid Al Rafi; Hyeran Jeon; Meng Tang

ReDistill: Residual Encoded Distillation for Peak Memory Reduction of CNNs

Fang Chen, Gourav Datta, Mujahid Al Rafi, Hyeran Jeon, Meng Tang

TL;DR

ReDistill addresses the challenge of high peak memory in CNN inference on memory-constrained edge devices by pairing aggressive initial pooling in a student network with Residual Encoded Distillation (RED) blocks that align the student’s down-sampled features with the teacher’s features. The RED block combines a gating mechanism and a residual encoder to efficiently transfer knowledge while bounding memory usage, enabling substantial memory reductions with minimal performance loss on image classification and diffusion-based image generation. Across extensive experiments, ReDistill outperforms existing KD methods in the memory-accuracy trade-off, achieving roughly four-to-fivefold peak-memory reductions for classification and about fourfold reductions for DDPM-based image generation, with practical deployment potential on edge hardware. The work provides a versatile, memory-centric distillation framework that can be integrated with existing KD strategies and quantization techniques, facilitating scalable edge deployment and paving the way for future extensions to other architectures such as vision transformers.

Abstract

The expansion of neural network sizes and the enhanced resolution of modern image sensors result in heightened memory and power demands to process modern computer vision models. In order to deploy these models in extremely resource-constrained edge devices, it is crucial to reduce their peak memory, which is the maximum memory consumed during the execution of a model. A naive approach to reducing peak memory is aggressive down-sampling of feature maps via pooling with large stride, which often results in unacceptable degradation in network performance. To mitigate this problem, we propose residual encoded distillation (ReDistill) for peak memory reduction in a teacher-student framework, in which a student network with less memory is derived from the teacher network using aggressive pooling. We apply our distillation method to multiple problems in computer vision, including image classification and diffusion-based image generation. For image classification, our method yields 4x-5x theoretical peak memory reduction with less degradation in accuracy for most CNN-based architectures. For diffusion-based image generation, our proposed distillation method yields a denoising network with 4x lower theoretical peak memory while maintaining decent diversity and fidelity for image generation. Experiments demonstrate our method's superior performance compared to other feature-based and response-based distillation methods when applied to the same student network. The code is available at https://github.com/mengtang-lab/ReDistill.

ReDistill: Residual Encoded Distillation for Peak Memory Reduction of CNNs

TL;DR

Abstract

Paper Structure (35 sections, 7 equations, 7 figures, 10 tables)

This paper contains 35 sections, 7 equations, 7 figures, 10 tables.

Introduction
Related Work
Memory-constrained deep learning
Knowledge distillation for image classification
Knowledge distillation for diffusion models
Proposed Method
Preliminaries
Proposed Distillation Framework
Residual Encoded Distillation Block
Loss Function
Distillation for Diffusion Model
Experiments
Datasets
Datasets for Image Classification
Datasets for Image Generation
...and 20 more sections

Figures (7)

Figure 1: (a) Left: For ImageNet classification, our distillation method significantly reduces the theoretical peak memory of ResNet-based models while achieving accuracy better than existing distillation methods. (b) Right: For diffusion-based image generation, our distilled network with $4{\times}$ lower theoretical peak memory generates images indistinguishable from the generated images of a teacher network.
Figure 2: Our proposed residual encoded distillation framework (ReDistill). RED blocks are incorporated into the student model following the pooling layers to minimize the discrepancy between the down-sampled features of the student and teacher models.
Figure 3: Residual Encoded Distillation (RED) Block. We use a logit module for the multiplicative gating mechanism and a residual encoder for additive residual learning.
Figure 4: ReDistill for denosing network in DDPM (ddpm2020). We integrate RED blocks into the student model after the down-sampling layer in the encoder and before the up-sample layer in the decoder.
Figure 5: Example of our proposed Aggressive Pooling Setting with ResNet18. We highlight all downsampling layers (either conv layer or maxpool layer with the stride larger than 1) in red color. For example, in the 'Conv 1' cell of 'T: ResNet18', 'conv $7 \times 7: c\_64, s\_2$' denotes this convolution layer is with kernel size $7 \times 7$, number of channels 64, and stride 2. In the following row, 'maxpool $3 \times 3: s\_2$' denotes the max pool layer with kernel size $3 \times 3$ and stride 2. In the student network 'S: ResNet18$\times 4$', we increase the stride of the first downsampling layer 'Conv 1' to $4\times$ from $s\_2$ to $s\_8$, while setting the stride to 1 of the maxpool layer and the last downsampling layer (i.e., the first conv layer in Stage 4), in order to get the same activation size before average pool and fc layer.
...and 2 more figures

ReDistill: Residual Encoded Distillation for Peak Memory Reduction of CNNs

TL;DR

Abstract

ReDistill: Residual Encoded Distillation for Peak Memory Reduction of CNNs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)