Table of Contents
Fetching ...

Coded Deep Learning: Framework and Algorithm

En-hui Yang, Shayan Mohajer Hamidi

TL;DR

CDL introduces a framework that injects information-theoretic coding into deep learning by using trainable probabilistic quantizers with CPMFs $P_{\alpha}(\cdot|\theta)$ and entropy constraints $H(\cdot)$, enabling quantized forward/backward passes and compressible weights/activations during training. The method yields a softened gradient path via the soft deterministic quantizer $\mathsf{Q}_{\rm d}(\cdot)$ and an entropy-regularized objective that minimizes both cross-entropy loss and description lengths. A relaxed variant, R-CDL, uses $\mathsf{Q}_{\rm d}(\cdot)$ for gradients with full-precision forward/backward, delivering better accuracy–compression trade-offs. Across CIFAR-100 and ImageNet with ResNet variants, CDL and R-CDL outperform existing QAT baselines at similar or lower bit regimes, while enabling highly compressible models via Huffman coding. The approach reduces training/inference complexity and communication costs in model/data parallelism, with the trained model stored in a quantized, compressible format.

Abstract

The success of deep learning (DL) is often achieved with large models and high complexity during both training and post-training inferences, hindering training in resource-limited settings. To alleviate these issues, this paper introduces a new framework dubbed ``coded deep learning'' (CDL), which integrates information-theoretic coding concepts into the inner workings of DL, to significantly compress model weights and activations, reduce computational complexity at both training and post-training inference stages, and enable efficient model/data parallelism. Specifically, within CDL, (i) we first propose a novel probabilistic method for quantizing both model weights and activations, and its soft differentiable variant which offers an analytic formula for gradient calculation during training; (ii) both the forward and backward passes during training are executed over quantized weights and activations, eliminating most floating-point operations and reducing training complexity; (iii) during training, both weights and activations are entropy constrained so that they are compressible in an information-theoretic sense throughout training, thus reducing communication costs in model/data parallelism; and (iv) the trained model in CDL is by default in a quantized format with compressible quantized weights, reducing post-training inference and storage complexity. Additionally, a variant of CDL, namely relaxed CDL (R-CDL), is presented to further improve the trade-off between validation accuracy and compression though requiring full precision in training with other advantageous features of CDL intact. Extensive empirical results show that CDL and R-CDL outperform the state-of-the-art algorithms in DNN compression in the literature.

Coded Deep Learning: Framework and Algorithm

TL;DR

CDL introduces a framework that injects information-theoretic coding into deep learning by using trainable probabilistic quantizers with CPMFs and entropy constraints , enabling quantized forward/backward passes and compressible weights/activations during training. The method yields a softened gradient path via the soft deterministic quantizer and an entropy-regularized objective that minimizes both cross-entropy loss and description lengths. A relaxed variant, R-CDL, uses for gradients with full-precision forward/backward, delivering better accuracy–compression trade-offs. Across CIFAR-100 and ImageNet with ResNet variants, CDL and R-CDL outperform existing QAT baselines at similar or lower bit regimes, while enabling highly compressible models via Huffman coding. The approach reduces training/inference complexity and communication costs in model/data parallelism, with the trained model stored in a quantized, compressible format.

Abstract

The success of deep learning (DL) is often achieved with large models and high complexity during both training and post-training inferences, hindering training in resource-limited settings. To alleviate these issues, this paper introduces a new framework dubbed ``coded deep learning'' (CDL), which integrates information-theoretic coding concepts into the inner workings of DL, to significantly compress model weights and activations, reduce computational complexity at both training and post-training inference stages, and enable efficient model/data parallelism. Specifically, within CDL, (i) we first propose a novel probabilistic method for quantizing both model weights and activations, and its soft differentiable variant which offers an analytic formula for gradient calculation during training; (ii) both the forward and backward passes during training are executed over quantized weights and activations, eliminating most floating-point operations and reducing training complexity; (iii) during training, both weights and activations are entropy constrained so that they are compressible in an information-theoretic sense throughout training, thus reducing communication costs in model/data parallelism; and (iv) the trained model in CDL is by default in a quantized format with compressible quantized weights, reducing post-training inference and storage complexity. Additionally, a variant of CDL, namely relaxed CDL (R-CDL), is presented to further improve the trade-off between validation accuracy and compression though requiring full precision in training with other advantageous features of CDL intact. Extensive empirical results show that CDL and R-CDL outperform the state-of-the-art algorithms in DNN compression in the literature.
Paper Structure (19 sections, 2 theorems, 38 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 19 sections, 2 theorems, 38 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

For any $\theta$ and $\alpha > 0$, where $\text{Var} \{ \mathsf{Q}_{\rm p}(\theta) ~|~\theta \}$ is the conditional variance of $\mathsf{Q}_{\rm p}(\theta)$ given $\theta$.

Figures (5)

  • Figure 1: Illustration of the partial derivatives of $\mathsf{Q}_{\rm d}(\theta)$ w.r.t. $\theta$ (left), and $q$ (right) for $\alpha=\{ 100,300,500,700\}$, where $b$ and $q$ equal $3$ and $0.1$, respectively.
  • Figure 2: Illustration of $\mathsf{Q}_{\rm u}(\cdot)$ vs $\mathsf{Q}_{\rm d}(\cdot)$ with $\alpha=\{ 100,300,500,700\}$, where $b$ and $q$ are set to $3$ and $0.1$, respectively.
  • Figure 3: Illustration of the CDL's mechanism.
  • Figure 4: Comparison of models trained by CDL, R-CDL, and benchmark methods in terms of the Top-1 accuracy vs the average number of bits per weight (top)/activation (bottom) on ImageNet: (a) ResNet-18, and (b) ResNet-34. All models are trained from scratch.
  • Figure :

Theorems & Definitions (6)

  • Remark
  • Proposition 1
  • proof
  • Lemma 1
  • Remark
  • Remark