Table of Contents
Fetching ...

Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy

Asit Mishra, Debbie Marr

TL;DR

This work tackles the accuracy drop of low-precision neural networks by integrating knowledge distillation with quantization through the Apprentice framework. It introduces three schemes—joint training, distillation from a trained teacher, and fine-tuning the student—to produce high-accuracy low-precision variants of ResNet on ImageNet. The approach achieves state-of-the-art results for ternary and 4-bit networks and demonstrates faster convergence in some schemes, with CIFAR-10 results supporting generality. The findings suggest that distillation can significantly recover performance loss due to aggressive quantization, enabling practical edge and cloud deployments with reduced compute and memory footprints.

Abstract

Deep learning networks have achieved state-of-the-art accuracies on computer vision workloads like image classification and object detection. The performant systems, however, typically involve big models with numerous parameters. Once trained, a challenging aspect for such top performing models is deployment on resource constrained inference systems - the models (often deep networks or wide networks or both) are compute and memory intensive. Low-precision numerics and model compression using knowledge distillation are popular techniques to lower both the compute requirements and memory footprint of these deployed models. In this paper, we study the combination of these two techniques and show that the performance of low-precision networks can be significantly improved by using knowledge distillation techniques. Our approach, Apprentice, achieves state-of-the-art accuracies using ternary precision and 4-bit precision for variants of ResNet architecture on ImageNet dataset. We present three schemes using which one can apply knowledge distillation techniques to various stages of the train-and-deploy pipeline.

Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy

TL;DR

This work tackles the accuracy drop of low-precision neural networks by integrating knowledge distillation with quantization through the Apprentice framework. It introduces three schemes—joint training, distillation from a trained teacher, and fine-tuning the student—to produce high-accuracy low-precision variants of ResNet on ImageNet. The approach achieves state-of-the-art results for ternary and 4-bit networks and demonstrates faster convergence in some schemes, with CIFAR-10 results supporting generality. The findings suggest that distillation can significantly recover performance loss due to aggressive quantization, enabling practical edge and cloud deployments with reduced compute and memory footprints.

Abstract

Deep learning networks have achieved state-of-the-art accuracies on computer vision workloads like image classification and object detection. The performant systems, however, typically involve big models with numerous parameters. Once trained, a challenging aspect for such top performing models is deployment on resource constrained inference systems - the models (often deep networks or wide networks or both) are compute and memory intensive. Low-precision numerics and model compression using knowledge distillation are popular techniques to lower both the compute requirements and memory footprint of these deployed models. In this paper, we study the combination of these two techniques and show that the performance of low-precision networks can be significantly improved by using knowledge distillation techniques. Our approach, Apprentice, achieves state-of-the-art accuracies using ternary precision and 4-bit precision for variants of ResNet architecture on ImageNet dataset. We present three schemes using which one can apply knowledge distillation techniques to various stages of the train-and-deploy pipeline.

Paper Structure

This paper contains 12 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Memory footprint of activations (ACTs) and weights (W) during inference for mini-batch sizes 1 and 8.
  • Figure 2: Schematic of the knowledge distillation setup. The teacher network is a high precision network and the apprentice network is a low-precision network.
  • Figure 3: Difference in Top-1 error rate for low-precision variants of ResNet-18 with (blue bars) and without (red bars) distillation scheme. The difference is calculated from the accuracy of ResNet-18 with full-precision numerics. Higher % difference denotes a better network configuration.
  • Figure 4: Difference in Top-1 error rate for low-precision variants of ResNet-34 and ResNet-50 with (blue bars) and without (red bars) distillation scheme. The difference is calculated from the accuracy of the baseline network (ResNet-34 for (a) and ResNet-50 for (b)) operating at full-precision. Higher % difference denotes a better network configuration.
  • Figure 5: Top-1 error rate versus epochs of four student networks using scheme-A and scheme-B.
  • ...and 1 more figures