Table of Contents
Fetching ...

MI-to-Mid Distilled Compression (M2M-DC): An Hybrid-Information-Guided-Block Pruning with Progressive Inner Slicing Approach to Model Compression

Lionel Levine, Haniyeh Ehsani Oskouie, Sajjad Ghiasvand, Majid Sarrafzadeh

TL;DR

M2M-DC presents a two-scale, information-guided compression framework that combines label-aware block-level mutual information pruning with planes-aware, residual-safe inner slicing and a brief staged knowledge distillation schedule. The method targets both inter-block redundancy and within-stage channel redundancy, while preserving residual geometry and recalibrating BatchNorm to maintain stable optimization. Empirical results on CIFAR-100 across ResNet-18, ResNet-34, and MobileNetV2 demonstrate that the resulting students match or exceed teacher accuracy at a fraction of the parameters and compute, establishing a strong accuracy–compute Pareto frontier. The approach is architecture-agnostic within residual and inverted-residual families and offers a practical, deployment-friendly recipe for compact models with robust performance. KD plays a key role in recovery after aggressive edits, and the method lays out a concrete path to broader hardware metrics, finer selection strategies, and transfer to larger datasets and tasks.

Abstract

We introduce MI-to-Mid Distilled Compression (M2M-DC), a two-scale, shape-safe compression framework that interleaves information-guided block pruning with progressive inner slicing and staged knowledge distillation (KD). First, M2M-DC ranks residual (or inverted-residual) blocks by a label-aware mutual information (MI) signal and removes the least informative units (structured prune-after-training). It then alternates short KD phases with stage-coherent, residual-safe channel slicing: (i) stage "planes" (co-slicing conv2 out-channels with the downsample path and next-stage inputs), and (ii) an optional mid-channel trim (conv1 out / bn1 / conv2 in). This targets complementary redundancy, whole computational motifs and within-stage width while preserving residual shape invariants. On CIFAR-100, M2M-DC yields a clean accuracy-compute frontier. For ResNet-18, we obtain 85.46% Top-1 with 3.09M parameters and 0.0139 GMacs (72% params, 63% GMacs vs. teacher; mean final 85.29% over three seeds). For ResNet-34, we reach 85.02% Top-1 with 5.46M params and 0.0195 GMacs (74% / 74% vs. teacher; mean final 84.62%). Extending to inverted-residuals, MobileNetV2 achieves a mean final 68.54% Top-1 at 1.71M params (27%) and 0.0186 conv GMacs (24%), improving over the teacher's 66.03% by +2.5 points across three seeds. Because M2M-DC exposes only a thin, architecture-aware interface (blocks, stages, and down sample/skip wiring), it generalizes across residual CNNs and extends to inverted-residual families with minor legalization rules. The result is a compact, practical recipe for deployment-ready models that match or surpass teacher accuracy at a fraction of the compute.

MI-to-Mid Distilled Compression (M2M-DC): An Hybrid-Information-Guided-Block Pruning with Progressive Inner Slicing Approach to Model Compression

TL;DR

M2M-DC presents a two-scale, information-guided compression framework that combines label-aware block-level mutual information pruning with planes-aware, residual-safe inner slicing and a brief staged knowledge distillation schedule. The method targets both inter-block redundancy and within-stage channel redundancy, while preserving residual geometry and recalibrating BatchNorm to maintain stable optimization. Empirical results on CIFAR-100 across ResNet-18, ResNet-34, and MobileNetV2 demonstrate that the resulting students match or exceed teacher accuracy at a fraction of the parameters and compute, establishing a strong accuracy–compute Pareto frontier. The approach is architecture-agnostic within residual and inverted-residual families and offers a practical, deployment-friendly recipe for compact models with robust performance. KD plays a key role in recovery after aggressive edits, and the method lays out a concrete path to broader hardware metrics, finer selection strategies, and transfer to larger datasets and tasks.

Abstract

We introduce MI-to-Mid Distilled Compression (M2M-DC), a two-scale, shape-safe compression framework that interleaves information-guided block pruning with progressive inner slicing and staged knowledge distillation (KD). First, M2M-DC ranks residual (or inverted-residual) blocks by a label-aware mutual information (MI) signal and removes the least informative units (structured prune-after-training). It then alternates short KD phases with stage-coherent, residual-safe channel slicing: (i) stage "planes" (co-slicing conv2 out-channels with the downsample path and next-stage inputs), and (ii) an optional mid-channel trim (conv1 out / bn1 / conv2 in). This targets complementary redundancy, whole computational motifs and within-stage width while preserving residual shape invariants. On CIFAR-100, M2M-DC yields a clean accuracy-compute frontier. For ResNet-18, we obtain 85.46% Top-1 with 3.09M parameters and 0.0139 GMacs (72% params, 63% GMacs vs. teacher; mean final 85.29% over three seeds). For ResNet-34, we reach 85.02% Top-1 with 5.46M params and 0.0195 GMacs (74% / 74% vs. teacher; mean final 84.62%). Extending to inverted-residuals, MobileNetV2 achieves a mean final 68.54% Top-1 at 1.71M params (27%) and 0.0186 conv GMacs (24%), improving over the teacher's 66.03% by +2.5 points across three seeds. Because M2M-DC exposes only a thin, architecture-aware interface (blocks, stages, and down sample/skip wiring), it generalizes across residual CNNs and extends to inverted-residual families with minor legalization rules. The result is a compact, practical recipe for deployment-ready models that match or surpass teacher accuracy at a fraction of the compute.

Paper Structure

This paper contains 58 sections, 7 equations, 7 tables, 1 algorithm.