Table of Contents
Fetching ...

MAST: Model-Agnostic Sparsified Training

Yury Demidovich, Grigory Malinovsky, Egor Shulgin, Peter Richtárik

TL;DR

MAST reframes training as sparsified optimization around a pre-trained center $v$ using random sketches $\mathbf{S}$, formulating $\min_x f_{\mathcal{D}}(x)=\mathbb{E}[f_{\mathbf{S}}(x)]$ with $f_{\mathbf{S}}(x)=f(v+\mathbf{S}(x-v))$. The gradient estimator $\nabla f_{\mathbf{S}}(x)=\mathbf{S}^{\top}\nabla f(v+\mathbf{S}(x-v))$ is unbiased when $\mathbb{E}[\mathbf{S}]=\mathbf{I}$, enabling SGD and VR-style algorithms that come with convergence guarantees in convex, strongly convex, and nonconvex regimes; the analysis ties performance to sketch properties via $L_{\mathbf{S}},\mu_{\mathbf{S}},L_{\mathcal{D}},\mu_{\mathcal{D}}$. The framework naturally subsumes Dropout and sparse training and extends to distributed/IST/FL contexts, deriving explicit rates and interpolation behavior under various assumptions. Empirical results on logistic regression and deep networks show MAST yields greater robustness to pruning and guidance for learning-rate tuning under sparsity, validating the theory and suggesting practical deployment benefits.

Abstract

We introduce a novel optimization problem formulation that departs from the conventional way of minimizing machine learning model loss as a black-box function. Unlike traditional formulations, the proposed approach explicitly incorporates an initially pre-trained model and random sketch operators, allowing for sparsification of both the model and gradient during training. We establish the insightful properties of the proposed objective function and highlight its connections to the standard formulation. Furthermore, we present several variants of the Stochastic Gradient Descent (SGD) method adapted to the new problem formulation, including SGD with general sampling, a distributed version, and SGD with variance reduction techniques. We achieve tighter convergence rates and relax assumptions, bridging the gap between theoretical principles and practical applications, covering several important techniques such as Dropout and Sparse training. This work presents promising opportunities to enhance the theoretical understanding of model training through a sparsification-aware optimization approach.

MAST: Model-Agnostic Sparsified Training

TL;DR

MAST reframes training as sparsified optimization around a pre-trained center using random sketches , formulating with . The gradient estimator is unbiased when , enabling SGD and VR-style algorithms that come with convergence guarantees in convex, strongly convex, and nonconvex regimes; the analysis ties performance to sketch properties via . The framework naturally subsumes Dropout and sparse training and extends to distributed/IST/FL contexts, deriving explicit rates and interpolation behavior under various assumptions. Empirical results on logistic regression and deep networks show MAST yields greater robustness to pruning and guidance for learning-rate tuning under sparsity, validating the theory and suggesting practical deployment benefits.

Abstract

We introduce a novel optimization problem formulation that departs from the conventional way of minimizing machine learning model loss as a black-box function. Unlike traditional formulations, the proposed approach explicitly incorporates an initially pre-trained model and random sketch operators, allowing for sparsification of both the model and gradient during training. We establish the insightful properties of the proposed objective function and highlight its connections to the standard formulation. Furthermore, we present several variants of the Stochastic Gradient Descent (SGD) method adapted to the new problem formulation, including SGD with general sampling, a distributed version, and SGD with variance reduction techniques. We achieve tighter convergence rates and relax assumptions, bridging the gap between theoretical principles and practical applications, covering several important techniques such as Dropout and Sparse training. This work presents promising opportunities to enhance the theoretical understanding of model training through a sparsification-aware optimization approach.
Paper Structure (42 sections, 51 theorems, 216 equations, 8 figures, 4 algorithms)

This paper contains 42 sections, 51 theorems, 216 equations, 8 figures, 4 algorithms.

Key Result

Lemma 1

If $f$ is $L_f$-smooth, then

Figures (8)

  • Figure 1: Test accuracies distributions of sparsified solutions for the ERM formulation \ref{['eq:main']} and MAST problem \ref{['eq:pretrained_compressed_problem']}. "Sparsity" corresponds to the percentage of zeroed weights.
  • Figure 2: Performance of Algorithm \ref{['alg:distributed_GD']} with Bernoulli sketches \ref{['eq:ind_sparse']} on standard loss \ref{['eq:fin_sketch_sum']} (for ${\bf S}_i \equiv \mathbf{I}$)
  • Figure 3: Accuracies distributions of sparsified solutions for the ERM \ref{['eq:main']} and MAST \ref{['eq:pretrained_compressed_problem']} formulations.
  • Figure 4: Test accuracies of sparsified solutions for the ERM formulation \ref{['eq:main']} and MAST problem \ref{['eq:pretrained_compressed_problem']}.
  • Figure 5: Finite-sum MAST loss \ref{['eq:sketch_fin_sum']} convergence for Algorithm \ref{['alg:SGD']} (II) with subsampling.
  • ...and 3 more figures

Theorems & Definitions (91)

  • Example 1
  • Example 2
  • Lemma 1: Consequences of $L_f$-smoothness
  • Lemma 2: Consequence of Convexity
  • Lemma 3: Consequences of $\mu_f$-convexity
  • Theorem 1
  • Lemma 4
  • Theorem 2
  • Theorem 2
  • Theorem 3
  • ...and 81 more