Table of Contents
Fetching ...

Stochastic activations

Maria Lomeli, Matthijs Douze, Gergely Szilvasy, Loic Cabannes, Jade Copet, Sainbayar Sukhbaatar, Jason Weston, Gabriel Synnaeve, Pierre-Emmanuel Mazaré, Hervé Jégou

TL;DR

The paper tackles the tension between the train-time optimization benefits of non-sparse activations (SILU) and the inference-time efficiency of sparse activations (RELU) by introducing Swi+FT and StochA. Swi+FT pretrains with SILU and finitely finetunes with RELU to recover sparsity at inference without sacrificing accuracy, while StochA introduces a stochastic choice between activations to balance optimization and sparsity and to boost generation diversity. Empirical results on LM1.5B and LM3B demonstrate notable CPU speedups (up to ~65% with 90% sparsity) and strong performance on downstream tasks, with StochA at test time yielding diverse outputs without extensive fine-tuning. Overall, the work provides practical activation-design strategies that improve inference efficiency and text-generation diversity, with generalizable concepts for arbitrary activation pairs.

Abstract

We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup on CPU and GPU. This leads to better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for sequence generation. This strategy performs reasonably well: it has higher diversity and has only slightly inferior performance to the best deterministic non-linearity, SILU, combined with temperature sampling. This provides an alternative way to increase the diversity of generated text.

Stochastic activations

TL;DR

The paper tackles the tension between the train-time optimization benefits of non-sparse activations (SILU) and the inference-time efficiency of sparse activations (RELU) by introducing Swi+FT and StochA. Swi+FT pretrains with SILU and finitely finetunes with RELU to recover sparsity at inference without sacrificing accuracy, while StochA introduces a stochastic choice between activations to balance optimization and sparsity and to boost generation diversity. Empirical results on LM1.5B and LM3B demonstrate notable CPU speedups (up to ~65% with 90% sparsity) and strong performance on downstream tasks, with StochA at test time yielding diverse outputs without extensive fine-tuning. Overall, the work provides practical activation-design strategies that improve inference efficiency and text-generation diversity, with generalizable concepts for arbitrary activation pairs.

Abstract

We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup on CPU and GPU. This leads to better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for sequence generation. This strategy performs reasonably well: it has higher diversity and has only slightly inferior performance to the best deterministic non-linearity, SILU, combined with temperature sampling. This provides an alternative way to increase the diversity of generated text.

Paper Structure

This paper contains 52 sections, 5 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Stochastic activation randomly selects one of two activations when $x<0$: (1) RELU selected with probability $1-p$; otherwise (2) another activation, in particular SILU.
  • Figure 2: Swi+FT: Training loss. Most of the training is carried out with SILU, with $\alpha*100\%$=5%, 10% and 20% of the final steps using RELU. Note the loss spike when we switch the activation. The model rapidly recovers and converges to a regime where RELU is performing well while providing sparsity.
  • Figure 3: Inference times for 1 token, as a function of the activation sparsity, with a LM3B model trained with Swi+FT. For the CPU we indicate the total inference time, including "other operations": the attention layers (these are not dominant because the generation is limited to 200 tokens), the normalization and the execution overheads. At 90% sparsity the speedup is $\times$1.65. For the GPU, we measure only the FFN inference time. Note that the impact on inference is less important because at this speed and model scale the overheads are dominant. We also indicate the minimum runtime that could be obtained given the GPU's memory bandwidth (roofline model).
  • Figure 4: Training loss with Swi+FT and StochA: [S|R]-S+ activation with $p=0.5$ for $\alpha*100\%$=$5\%$, $10\%$ and $20\%$, relative to RELU and SILU. This shows that the Swi+FT strategy needs to be combined with StochA to provide good models operating with RELU compared to finetuning SILU with RELU alone (Figure \ref{['fig:silu_finetuned_w_relu']}). This plot this is zoomed in relative to Figure \ref{['fig:silu_finetuned_w_relu']}.
  • Figure 5: Swi+FT: analysis of the fine-tuning rate $\alpha$. Average performance over the benchmarks as a function of the percentage $\alpha$ of steps for which we switch to the RELU activation at the end of training. We use RELU at inference time.
  • ...and 2 more figures