Stochastic activations
Maria Lomeli, Matthijs Douze, Gergely Szilvasy, Loic Cabannes, Jade Copet, Sainbayar Sukhbaatar, Jason Weston, Gabriel Synnaeve, Pierre-Emmanuel Mazaré, Hervé Jégou
TL;DR
The paper tackles the tension between the train-time optimization benefits of non-sparse activations (SILU) and the inference-time efficiency of sparse activations (RELU) by introducing Swi+FT and StochA. Swi+FT pretrains with SILU and finitely finetunes with RELU to recover sparsity at inference without sacrificing accuracy, while StochA introduces a stochastic choice between activations to balance optimization and sparsity and to boost generation diversity. Empirical results on LM1.5B and LM3B demonstrate notable CPU speedups (up to ~65% with 90% sparsity) and strong performance on downstream tasks, with StochA at test time yielding diverse outputs without extensive fine-tuning. Overall, the work provides practical activation-design strategies that improve inference efficiency and text-generation diversity, with generalizable concepts for arbitrary activation pairs.
Abstract
We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup on CPU and GPU. This leads to better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for sequence generation. This strategy performs reasonably well: it has higher diversity and has only slightly inferior performance to the best deterministic non-linearity, SILU, combined with temperature sampling. This provides an alternative way to increase the diversity of generated text.
