Table of Contents
Fetching ...

TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training

Michael Menezes, Barbara Su, Xinze Feng, Yehya Farhat, Hamza Shili, Anastasios Kyrillidis

TL;DR

This work tackles the high cost of training and deploying large language models by reframing sparsification as a training-time problem. The authors introduce TwIST, a distributed framework that trains multiple independent subnets in parallel, periodically aggregates them to a central model, and resamples new subnets during training, enabling zero-cost pruning at deployment. Central contributions include the subnet generation/dispatch/aggregation pipeline, the activation-shift correction mechanism, and the empirical validation of the golden lottery ticket hypothesis, showing that randomly sampled subnets can achieve high performance without fine-tuning, especially under aggressive sparsity ($κ \le 0.5$). TwIST also demonstrates tangible training-efficiency gains and results in dense, hardware-friendly subnetworks that deliver speedups on commodity hardware, making sparse LLMs more practical for broad deployment. Overall, this approach shifts the sparsification burden from post-training recovery to the training process itself, offering a scalable path to deployable sparse Transformer models with competitive perplexity. $\,$

Abstract

We introduce TwIST, a distributed training framework for efficient large language model (LLM) sparsification. TwIST trains multiple subnetworks in parallel, periodically aggregates their parameters, and resamples new subnetworks during training. This process identifies high-quality subnetworks ("golden tickets") without requiring post-training procedures such as calibration or Hessian-based recovery. As a result, TwIST enables zero-cost pruning at deployment time while achieving perplexity competitive with state-of-the-art post-training sparsification methods. The benefits are most pronounced under aggressive sparsity (e.g., 50%+), where TwIST significantly outperforms baseline methods; for example, reaching 23.14 PPL compared to 31.64 for the closest prior approach. Unlike unstructured pruning, TwIST produces structured, dense matrices that offer practical inference speedups and memory reductions on commodity hardware (e.g., CPUs) that do not support efficient sparse computation. TwIST provides an efficient training-time path to deployable sparse LLMs without additional fine-tuning or recovery overhead.

TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training

TL;DR

This work tackles the high cost of training and deploying large language models by reframing sparsification as a training-time problem. The authors introduce TwIST, a distributed framework that trains multiple independent subnets in parallel, periodically aggregates them to a central model, and resamples new subnets during training, enabling zero-cost pruning at deployment. Central contributions include the subnet generation/dispatch/aggregation pipeline, the activation-shift correction mechanism, and the empirical validation of the golden lottery ticket hypothesis, showing that randomly sampled subnets can achieve high performance without fine-tuning, especially under aggressive sparsity (). TwIST also demonstrates tangible training-efficiency gains and results in dense, hardware-friendly subnetworks that deliver speedups on commodity hardware, making sparse LLMs more practical for broad deployment. Overall, this approach shifts the sparsification burden from post-training recovery to the training process itself, offering a scalable path to deployable sparse Transformer models with competitive perplexity.

Abstract

We introduce TwIST, a distributed training framework for efficient large language model (LLM) sparsification. TwIST trains multiple subnetworks in parallel, periodically aggregates their parameters, and resamples new subnetworks during training. This process identifies high-quality subnetworks ("golden tickets") without requiring post-training procedures such as calibration or Hessian-based recovery. As a result, TwIST enables zero-cost pruning at deployment time while achieving perplexity competitive with state-of-the-art post-training sparsification methods. The benefits are most pronounced under aggressive sparsity (e.g., 50%+), where TwIST significantly outperforms baseline methods; for example, reaching 23.14 PPL compared to 31.64 for the closest prior approach. Unlike unstructured pruning, TwIST produces structured, dense matrices that offer practical inference speedups and memory reductions on commodity hardware (e.g., CPUs) that do not support efficient sparse computation. TwIST provides an efficient training-time path to deployable sparse LLMs without additional fine-tuning or recovery overhead.

Paper Structure

This paper contains 22 sections, 10 theorems, 42 equations, 10 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1.1

Let $\bm{x} \in \mathbb{R}^{d_{\text{in}}}$ be a random vector with i.i.d. components. Let $\bm{W} \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ be a random matrix with i.i.d. components, $\bm{W}_{ji} \sim \mathcal{N}(0, \sigma_W^2)$. If we let $\bm{y} := \bm{W} \bm{x}$, then the expected sq

Figures (10)

  • Figure 1: TwIST system overview. (1) From a central model, (2) a subnet generator* creates diverse subnets. (3) A dispatcher sends these subnets via Peripheral Component Interconnect express (PCIe) to (4) multiple workers for parallel training on distinct data shards. (5) An aggregator updates the central model by averaging the parameters from the trained subnets using the shown formula. (6) The final model is then deployed to LLM (Large Language Model) clients for inference. (*The generator supports different heuristics for training vs. deployment.)
  • Figure 2: TwIST's asymptotic impact on memory for GPT-2 model variants. Subnets have half the blocks of the full model.
  • Figure 3: Visualization of the three TwIST variants and their training dynamics in the case of $S = 4$, where Worker $1$ is the central accelerator. In Masked TwIST, subnets are simulated by masking activations within a single shared model. In True TwIST, each worker trains a physically smaller subnet that is scattered from and later synchronized with the central model. In Hybrid TwIST, the central model participates in training as a masked subnet while the remaining workers train physical subnets. Within each repartition interval, the yellow regions indicate active parameters being updated during training, while the grey regions denote inactive parameters that are frozen or masked out.
  • Figure 4: Distribution of eval loss for randomly generated subnets in the attn configuration. The distributions for TwIST ($SE_{6/12}$) are compared against a DDP baseline across various subnet ratios. The $SE_{6/12}$ variant of TwIST is presented for a direct comparison, as both this method and DDP involve only a single training pass.
  • Figure 5: Heatmap of subnet robustness for the attn setting. Brighter colors (yellow) signify lower PPL (better performance), and darker colors (blue) signify higher PPL.
  • ...and 5 more figures

Theorems & Definitions (20)

  • Lemma 1.1
  • proof
  • Lemma 1.2
  • proof
  • Theorem 1.3
  • proof
  • Lemma 1.4
  • proof
  • Lemma 1.5
  • proof
  • ...and 10 more