HESSO: Towards Automatic Efficient and User Friendly Any Neural Network Training and Pruning

Tianyi Chen; Xiaoyi Qu; David Aponte; Colby Banbury; Jongwoo Ko; Tianyu Ding; Yong Ma; Vladimir Lyapunov; Ilya Zharkov; Luming Liang

HESSO: Towards Automatic Efficient and User Friendly Any Neural Network Training and Pruning

Tianyi Chen, Xiaoyi Qu, David Aponte, Colby Banbury, Jongwoo Ko, Tianyu Ding, Yong Ma, Vladimir Lyapunov, Ilya Zharkov, Luming Liang

TL;DR

The paper addresses the high cost of deploying large DNNs by introducing HESSO, a Hybrid Efficient Structured Sparsity Optimizer, and its CRIC extension for reliable pruning. HESSO combines progressive pruning, flexible saliency scoring, and a hybrid training scheme to automatically produce high-performing sub-networks with minimal hyperparameter tuning, while CRIC mitigates approximation errors that can lead to irreversible performance loss. The approach is architecture-agnostic and validated across vision, detection, NLP, and large language models, often achieving state-of-the-art or competitive results with reduced tuning overhead. Together, HESSO and CRIC offer a practical, scalable solution for automatic training and pruning that enables efficient deployment of compact DNNs in resource-constrained environments.

Abstract

Structured pruning is one of the most popular approaches to effectively compress the heavy deep neural networks (DNNs) into compact sub-networks while retaining performance. The existing methods suffer from multi-stage procedures along with significant engineering efforts and human expertise. The Only-Train-Once (OTO) series has been recently proposed to resolve the many pain points by streamlining the workflow by automatically conducting (i) search space generation, (ii) structured sparse optimization, and (iii) sub-network construction. However, the built-in sparse optimizers in the OTO series, i.e., the Half-Space Projected Gradient (HSPG) family, have limitations that require hyper-parameter tuning and the implicit controls of the sparsity exploration, consequently requires intervening by human expertise. To address such limitations, we propose a Hybrid Efficient Structured Sparse Optimizer (HESSO). HESSO could automatically and efficiently train a DNN to produce a high-performing subnetwork. Meanwhile, it is almost tuning-free and enjoys user-friendly integration for generic training applications. To address another common issue of irreversible performance collapse observed in pruning DNNs, we further propose a Corrective Redundant Identification Cycle (CRIC) for reliably identifying indispensable structures. We numerically demonstrate the efficacy of HESSO and its enhanced version HESSO-CRIC on a variety of applications ranging from computer vision to natural language processing, including large language model. The numerical results showcase that HESSO can achieve competitive even superior performance to varying state-of-the-arts and support most DNN architectures. Meanwhile, CRIC can effectively prevent the irreversible performance collapse and further enhance the performance of HESSO on certain applications.

HESSO: Towards Automatic Efficient and User Friendly Any Neural Network Training and Pruning

TL;DR

Abstract

Paper Structure (29 sections, 3 theorems, 9 equations, 6 figures, 10 tables, 3 algorithms)

This paper contains 29 sections, 3 theorems, 9 equations, 6 figures, 10 tables, 3 algorithms.

Introduction
Related Works
General Pruning Procedures.
Automated Pruning Given Pre-defined Search Space.
Automated Pruning Over Any DNNs.
Knowledge Transfer.
Neural Architecture Optimization.
HESSO
Saliency Score
Hybrid Training in HESSO
Minimize tuning efforts compared to DHSPG.
Architecture-agnostic computational invariance compared to ResRep and SliceGPT.
Approximation Errors of Saliency Scores
Saliency score approximation errors.
Corrective Redundancy Identification Circle
...and 14 more sections

Key Result

Theorem 3.2

Suppose the gradient and second-order derivative of $f$ are bounded. Use first-order $m^L$ and second-order $m^Q$ Taylor approximations to measure the function value $f$ after pruning $g\in\mathcal{G}$, i.e., $[\bm{x}]_g\to \bm{0}$. Let $\bm{s}$ satisfy $[\bm{s}]_{\mathcal{G}/g}=[\bm{0}]_{\mathcal{G

Figures (6)

Figure 1: Automatic any DNN joint training and structured pruning experience achieved by the pruning mode of OTO along with the proposed HESSO and its enhanced HESSO-CRIC optimizer. The procedure could be applied onto varying DNN and applications, and seamlessly integrated into any training pipeline to directly produce a compact pruned sub-network without further fine-tuning.
Figure 2: Automated trainable variable partitions for one-shot structured pruning. Given the trace graph shown in Figure \ref{['fig:demonet_tracegraph']}, automatic pruning frameworks such as OTOv2 chen2023otov2 construct a pruning dependency graph shown as Figure \ref{['fig:demonet_depgraph']} and partition the trainable variables as pruning zero-invariant groups $\mathcal{G}$ in Figure \ref{['fig:partitioned_zigs']}.
Figure 3: HESSO uses saliency scores to periodically identify redundant groups $\mathcal{G}_R$ from the group set $\mathcal{G}$ and marks the remaining groups as important groups $\mathcal{G}_I$. A knowledge transfer mechanism is proceeded by employing hybrid training strategies onto $\mathcal{G}_R$ and $\mathcal{G}_I$. In particular, the variables in $\mathcal{G}_R$ are progressively projected onto zeros after gradient descent. The important variables are kept training via gradient descent to migrate the impact of redundant project onto the objective function.
Figure 4: ResNet50 on ImageNet.
Figure 5: Visual examples of pruned YOLOv5l.
...and 1 more figures

Theorems & Definitions (5)

Definition 3.1: Indispensable structure
Theorem 3.2: Approximation error of Taylor importance
Theorem 3.3: Finite termination of CRIC
Corollary 3.4: Upper bounds of cycle numbers
proof

HESSO: Towards Automatic Efficient and User Friendly Any Neural Network Training and Pruning

TL;DR

Abstract

HESSO: Towards Automatic Efficient and User Friendly Any Neural Network Training and Pruning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (5)