TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training
Michael Menezes, Barbara Su, Xinze Feng, Yehya Farhat, Hamza Shili, Anastasios Kyrillidis
TL;DR
This work tackles the high cost of training and deploying large language models by reframing sparsification as a training-time problem. The authors introduce TwIST, a distributed framework that trains multiple independent subnets in parallel, periodically aggregates them to a central model, and resamples new subnets during training, enabling zero-cost pruning at deployment. Central contributions include the subnet generation/dispatch/aggregation pipeline, the activation-shift correction mechanism, and the empirical validation of the golden lottery ticket hypothesis, showing that randomly sampled subnets can achieve high performance without fine-tuning, especially under aggressive sparsity ($κ \le 0.5$). TwIST also demonstrates tangible training-efficiency gains and results in dense, hardware-friendly subnetworks that deliver speedups on commodity hardware, making sparse LLMs more practical for broad deployment. Overall, this approach shifts the sparsification burden from post-training recovery to the training process itself, offering a scalable path to deployable sparse Transformer models with competitive perplexity. $\,$
Abstract
We introduce TwIST, a distributed training framework for efficient large language model (LLM) sparsification. TwIST trains multiple subnetworks in parallel, periodically aggregates their parameters, and resamples new subnetworks during training. This process identifies high-quality subnetworks ("golden tickets") without requiring post-training procedures such as calibration or Hessian-based recovery. As a result, TwIST enables zero-cost pruning at deployment time while achieving perplexity competitive with state-of-the-art post-training sparsification methods. The benefits are most pronounced under aggressive sparsity (e.g., 50%+), where TwIST significantly outperforms baseline methods; for example, reaching 23.14 PPL compared to 31.64 for the closest prior approach. Unlike unstructured pruning, TwIST produces structured, dense matrices that offer practical inference speedups and memory reductions on commodity hardware (e.g., CPUs) that do not support efficient sparse computation. TwIST provides an efficient training-time path to deployable sparse LLMs without additional fine-tuning or recovery overhead.
