The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information

Diyuan Wu; Ionut-Vlad Modoranu; Mher Safaryan; Denis Kuznedelev; Dan Alistarh

The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information

Diyuan Wu, Ionut-Vlad Modoranu, Mher Safaryan, Denis Kuznedelev, Dan Alistarh

TL;DR

This work can leverage curvature information in OBS-like fashion upon the projection step of classic iterative sparse recovery algorithms such as IHT and shows for the first time that this leads both to improved convergence bounds under standard assumptions and to new sparse recovery algorithms inspired by the OBS framework.

Abstract

The rising footprint of machine learning has led to a focus on imposing \emph{model sparsity} as a means of reducing computational and memory costs. For deep neural networks (DNNs), the state-of-the-art accuracy-vs-sparsity is achieved by heuristics inspired by the classical Optimal Brain Surgeon (OBS) framework~\citep{lecun90brain, hassibi1992second, hassibi1993optimal}, which leverages loss curvature information to make better pruning decisions. Yet, these results still lack a solid theoretical understanding, and it is unclear whether they can be improved by leveraging connections to the wealth of work on sparse recovery algorithms. In this paper, we draw new connections between these two areas and present new sparse recovery algorithms inspired by the OBS framework that comes with theoretical guarantees under reasonable assumptions and have strong practical performance. Specifically, our work starts from the observation that we can leverage curvature information in OBS-like fashion upon the projection step of classic iterative sparse recovery algorithms such as IHT. We show for the first time that this leads both to improved convergence bounds under standard assumptions. Furthermore, we present extensions of this approach to the practical task of obtaining accurate sparse DNNs, and validate it experimentally at scale for Transformer-based models on vision and language tasks.

The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information

TL;DR

Abstract

Paper Structure (31 sections, 15 theorems, 72 equations, 3 figures, 5 tables, 2 algorithms)

This paper contains 31 sections, 15 theorems, 72 equations, 3 figures, 5 tables, 2 algorithms.

Introduction
Related Work
Derivation and convergence of I-OBS
Notation
Problem Setup and Assumptions
IHT as a Proximal Point Method
I-OBS: Leveraging Second-order Information
Connection to WoodFisher/WoodTaylor singh2020woodfisher
Connection to OBC frantar2022optimal
Local convergence of I-OBS.
A Practical Variant: TopI-OBS
Experiments
Synthetic Experiments for Sparse Linear Regression
Applying I-OBS to Model Pruning
Detailed algorithm.
...and 16 more sections

Key Result

Lemma 1

The closed-form solution to (PPM-IHT) with linear model $\phi_t(\theta) = f(\theta_t) + \langle \nabla f(\theta_t), \theta-\theta_t \rangle$ is

Figures (3)

Figure 1: Comparison of $k$-IHT and Top$k$-WoodTaylor for sparse linear regression with standard Gaussian and MNIST priors.
Figure 2: I-OBS dynamics for Llama-2 7B (star corresponds to best validation score). (Left) Wikitext-2 Perplexity vs iteration. (Right) C4 Perplexity vs iteration.
Figure 3: I-OBS dynamics for Llama-3 8B (star corresponds to best validation score). (Left) Wikitext-2 Perplexity vs iteration. (Right) C4 Perplexity vs iteration.

Theorems & Definitions (24)

Lemma 1
Lemma 2
Theorem 1
proof : Proof sketch
Lemma 3
Lemma 4
Lemma 4
proof : Proof of Lemma \ref{['lemma:kiht from ppm']}
Lemma 4
proof : Proof of Lemma \ref{['lemma:WT-full update']}
...and 14 more

The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information

TL;DR

Abstract

The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (24)