Table of Contents
Fetching ...

Convex Distillation: Efficient Compression of Deep Networks via Convex Optimization

Prateek Varshney, Mert Pilanci

TL;DR

It is shown that convex neural networks, when provided with rich feature representations from a large pre-trained non-convex model, can achieve performance comparable to their non-convex counterparts, opening up avenues for future research at the intersection of convex optimization and deep learning.

Abstract

Deploying large and complex deep neural networks on resource-constrained edge devices poses significant challenges due to their computational demands and the complexities of non-convex optimization. Traditional compression methods such as distillation and pruning often retain non-convexity that complicates fine-tuning in real-time on such devices. Moreover, these methods often necessitate extensive end-to-end network fine-tuning after compression to preserve model performance, which is not only time-consuming but also requires fully annotated datasets, thus potentially negating the benefits of efficient network compression. In this paper, we introduce a novel distillation technique that efficiently compresses the model via convex optimization -- eliminating intermediate non-convex activation functions and using only intermediate activations from the original model. Our approach enables distillation in a label-free data setting and achieves performance comparable to the original model without requiring any post-compression fine-tuning. We demonstrate the effectiveness of our method for image classification models on multiple standard datasets, and further show that in the data limited regime, our method can outperform standard non-convex distillation approaches. Our method promises significant advantages for deploying high-efficiency, low-footprint models on edge devices, making it a practical choice for real-world applications. We show that convex neural networks, when provided with rich feature representations from a large pre-trained non-convex model, can achieve performance comparable to their non-convex counterparts, opening up avenues for future research at the intersection of convex optimization and deep learning.

Convex Distillation: Efficient Compression of Deep Networks via Convex Optimization

TL;DR

It is shown that convex neural networks, when provided with rich feature representations from a large pre-trained non-convex model, can achieve performance comparable to their non-convex counterparts, opening up avenues for future research at the intersection of convex optimization and deep learning.

Abstract

Deploying large and complex deep neural networks on resource-constrained edge devices poses significant challenges due to their computational demands and the complexities of non-convex optimization. Traditional compression methods such as distillation and pruning often retain non-convexity that complicates fine-tuning in real-time on such devices. Moreover, these methods often necessitate extensive end-to-end network fine-tuning after compression to preserve model performance, which is not only time-consuming but also requires fully annotated datasets, thus potentially negating the benefits of efficient network compression. In this paper, we introduce a novel distillation technique that efficiently compresses the model via convex optimization -- eliminating intermediate non-convex activation functions and using only intermediate activations from the original model. Our approach enables distillation in a label-free data setting and achieves performance comparable to the original model without requiring any post-compression fine-tuning. We demonstrate the effectiveness of our method for image classification models on multiple standard datasets, and further show that in the data limited regime, our method can outperform standard non-convex distillation approaches. Our method promises significant advantages for deploying high-efficiency, low-footprint models on edge devices, making it a practical choice for real-world applications. We show that convex neural networks, when provided with rich feature representations from a large pre-trained non-convex model, can achieve performance comparable to their non-convex counterparts, opening up avenues for future research at the intersection of convex optimization and deep learning.

Paper Structure

This paper contains 21 sections, 3 theorems, 18 equations, 7 figures, 2 tables.

Key Result

Theorem 1

Let $\mathbf{X} \in \mathbb{R}^{n\times d}$ be a data matrix and $\mathbf{y} \in \mathbb{R}^n$ the associated scalar targets. The two-layer ReLU neural network can then be expressed as: where $\mathbf{W_1} \in \mathbb{R}^{m\times d}$, $\mathbf{w}_2 \in \mathbb{R}^m$ are the weights of the first and second layers, $m$ is the number of hidden units, and $\mathsf{ReLU}(\cdot)$ is the ReLU activation

Figures (7)

  • Figure 1: For Resnet18 architecture, we first distill Block4 by training our convex block (orange) over input-output activations dataset. Post-training, we simply swap out the exisiting non-convex block and replace it with our convex block. Note that all other layers are kept frozen (marked in purple).
  • Figure 2: To improve $\mathsf{SCNN}$'s one-vs-all solution, we freeze $\mathbf{W}_1^\star$ but recompute $\mathbf{W}_2^\star$ for equation \ref{['eq:activation-distillation']} by enforcing information sharing (red lines) across the constituent $\mathbf{W}_{1i}$'s.
  • Figure 3: $\sf{S_\text{convex}}$ v/s $\sf{S_\text{non-convex}}$ performance comparisons in low-sample and high compression regimes.
  • Figure 4: Performance comparisons of all three distillation methods on Blocks 3, 4, and their combinations, of the Resnet18 model on CIFAR10. The Black dotted line denotes the original fine-tuned model's performance on CIFAR10. In the leftmost subplot, we distill only Block 3, in the middle subplot only Block 4, and in the right subplot, we plug and play different combinations of the compressed blocks into the original model.
  • Figure 5: Comparison of different optimization routines when distilling the Block 4 + Classification Head for a binary classification task on TinyImagenet.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Theorem 1: Convex equivalence for ReLU networks
  • Theorem 2: Convex equivalence for GReLU networks
  • Theorem 3: Convex equivalence for vector output networks