Table of Contents
Fetching ...

HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

Andy Li, Aiden Durrant, Milan Markovic, Georgios Leontidis

Abstract

Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.

HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

Abstract

Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.
Paper Structure (18 sections, 3 theorems, 7 equations, 7 figures, 3 tables)

This paper contains 18 sections, 3 theorems, 7 equations, 7 figures, 3 tables.

Key Result

Lemma 1

Let $\mathcal{A}_{\text{head}}$ be the set of architectures reachable by head/block (macro) gates only, and let $\mathcal{A}_{\text{hiap}}$ be those additionally using micro-gates over attention dimensions or FFN neurons. If any layer has $D_h > 1$ or $D_{\text{ffn}} > 1$, then $\mathcal{A}_{\text{h

Figures (7)

  • Figure 1: Overview of the Hierarchical Auto-Pruning (HiAP) framework applied to a standard Vision Transformer block. The architecture's topology is governed by learnable Gumbel-Sigmoid gates operating at two distinct granularities. Macro-gates (Block and Head logits) evaluate whether to retain or bypass entire MLP modules and attention heads. Concurrently, micro-gates (Neuron and Dimension logits) prune fine-grained structures within the surviving active structures. This dual-level formulation allows the network to autonomously carve out an optimal, hardware-efficient sub-network during a single end-to-end training phase.
  • Figure 2: Gumbel-Sigmoid temperature annealing over the course of the single-phase training. During the early epochs (e.g., $\tau=2.0$), the distribution resembles a Gaussian, acting as a soft, continuous regularizer. As training progresses and $\tau$ decays, the probability density sharply bi-furcates toward 0 and 1, naturally hardening the network into a discrete sub-architecture without inducing gradient shock.
  • Figure 3: The architecture topology at end of training
  • Figure 4: Top-1 Accuracy vs GFLOP evaluated at early training stage across different penalty configurations
  • Figure 5: $\lambda_{macro}=0.9, \lambda_{micro}=0.45$
  • ...and 2 more figures

Theorems & Definitions (6)

  • Lemma 1: Expressivity: strict superset
  • proof : Proof sketch
  • Proposition 1: Budget linearity
  • proof : Proof sketch
  • Proposition 2: Soft-to-hard budget alignment
  • proof : Proof sketch