Table of Contents
Fetching ...

Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training

Hanna Mazzawi, Pranjal Awasthi, Xavi Gonzalvo, Srikumar Ramalingam

TL;DR

This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment and demonstrates that applying the technique can modify the training dynamics resulting in performance gains across architectures and tasks while maintaining the inference performance consistent.

Abstract

Recent breakthroughs and successful deployment of large language and vision models in a constrained environment predominantly follow a two phase approach. First, large models are trained to achieve peak performance, followed by a model shrinking method to meet hardware constraints; Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones. Formally, this can be seen as the problem of identifying an optimal model of size $n$ from a larger model of size $k \cdot n$, where $k > 1$ is the overparameterization factor. This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment. Our contribution is an effective architectural change, namely, {\it Majority Kernels} that is compatible with the main standard architectures such as multi-layer perceptrons (MLPs), Residual networks (ResNets), and Transformers. We demonstrate that applying our technique can modify the training dynamics resulting in performance gains across architectures and tasks while maintaining the inference performance consistent. Furthermore, our approach adds minimal overhead to the cost incurred (wall clock time) at training time. The proposed approach shows strong performance on a wide variety of datasets and models, even outperforming strong baselines such as distilled ensembles as well as combinatorial optimization methods based on submodular optimization.

Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training

TL;DR

This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment and demonstrates that applying the technique can modify the training dynamics resulting in performance gains across architectures and tasks while maintaining the inference performance consistent.

Abstract

Recent breakthroughs and successful deployment of large language and vision models in a constrained environment predominantly follow a two phase approach. First, large models are trained to achieve peak performance, followed by a model shrinking method to meet hardware constraints; Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones. Formally, this can be seen as the problem of identifying an optimal model of size from a larger model of size , where is the overparameterization factor. This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment. Our contribution is an effective architectural change, namely, {\it Majority Kernels} that is compatible with the main standard architectures such as multi-layer perceptrons (MLPs), Residual networks (ResNets), and Transformers. We demonstrate that applying our technique can modify the training dynamics resulting in performance gains across architectures and tasks while maintaining the inference performance consistent. Furthermore, our approach adds minimal overhead to the cost incurred (wall clock time) at training time. The proposed approach shows strong performance on a wide variety of datasets and models, even outperforming strong baselines such as distilled ensembles as well as combinatorial optimization methods based on submodular optimization.
Paper Structure (14 sections, 5 theorems, 29 equations, 10 figures, 5 tables, 4 algorithms)

This paper contains 14 sections, 5 theorems, 29 equations, 10 figures, 5 tables, 4 algorithms.

Key Result

Theorem 3.1

Let $L$ be a sufficiently differentiable function on the parameter space $\theta \in \mathbb{R}^n$. The modified loss when using Majority Kernels is, where $\epsilon \in \mathbb{R}^n$ is is the random perturbation of the virtual parameters.

Figures (10)

  • Figure 1.1: The majority kernels for the $i$-th layer.
  • Figure 4.2: Test curves for Baseline-Long (blue), Majority Kernels (orange), Adv-Majority Kernels (green), distilled-Baseline (red) and between (purple). It is easy to see that the generalization gap (accuracy gap between train and test above) is better in our optimization throughout training with better final performance.
  • Figure 4.3: Results on ImageNet for the various algorithms.
  • Figure D.4: Train and eval curves for algorithms \ref{['adv_only']} and \ref{['adv_mj']}
  • Figure E.5: Eval curves for our algorithm compared to vanilla training on Glue Cola
  • ...and 5 more figures

Theorems & Definitions (7)

  • Theorem 3.1: Backward Error Analysis
  • Theorem A.1: Backward Error Analysis
  • proof
  • Corollary A.2
  • Lemma B.0: PAC-Bayesian Bound with Stochastic Weights
  • Lemma B.0: Reduced Learning Rate with uniform probabilities
  • proof