Table of Contents
Fetching ...

Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre

TL;DR

It is shown that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off.

Abstract

State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. This has sparked a research agenda to reduce these models' parameter counts and computational costs without significantly impacting their performance. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs), which are less studied than attention blocks. We consider three structured linear parameterizations of the FFN using efficient low-rank and block-diagonal matrices. In contrast to many previous works that examined these approximations, our study i) explores these structures from a training-from-scratch perspective, ii) scales up to 1.3B parameters, and iii) is conducted within recent Transformer-based LLMs rather than convolutional architectures. We demonstrate that these structures can lead to actual computational gains in various scenarios, including online decoding when using a pre-merge technique. Additionally, we propose a novel training regime, called \textit{self-guided training}, aimed at improving the poor training dynamics that these approximations exhibit when used from initialization. Interestingly, the scaling performance of structured matrices is explored, revealing steeper curves in scaling training FLOPs, along with a favorable scaling trend in the overtraining regime. Specifically, we show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off. Our code is available at \url{https://github.com/CLAIRE-Labo/StructuredFFN/tree/main}.

Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

TL;DR

It is shown that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off.

Abstract

State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. This has sparked a research agenda to reduce these models' parameter counts and computational costs without significantly impacting their performance. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs), which are less studied than attention blocks. We consider three structured linear parameterizations of the FFN using efficient low-rank and block-diagonal matrices. In contrast to many previous works that examined these approximations, our study i) explores these structures from a training-from-scratch perspective, ii) scales up to 1.3B parameters, and iii) is conducted within recent Transformer-based LLMs rather than convolutional architectures. We demonstrate that these structures can lead to actual computational gains in various scenarios, including online decoding when using a pre-merge technique. Additionally, we propose a novel training regime, called \textit{self-guided training}, aimed at improving the poor training dynamics that these approximations exhibit when used from initialization. Interestingly, the scaling performance of structured matrices is explored, revealing steeper curves in scaling training FLOPs, along with a favorable scaling trend in the overtraining regime. Specifically, we show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off. Our code is available at \url{https://github.com/CLAIRE-Labo/StructuredFFN/tree/main}.

Paper Structure

This paper contains 38 sections, 5 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Steeper scaling curves of LowRank with 63% or 32% FFN parameters. For more results, see \ref{['subsec:scaling']}.
  • Figure 2: Structured linear parametrization: We show the structured linear parametrization with input dim. of $N$ and output dim. of $M$. a) The traditional dense linear parametrization. b) LowRank parametrization with a bottleneck of size $R$ where $R$ is less than $M$ and $N$. c) BlockShuffle with two block-diagonal matrices with blocks of size $B$ interleaved with a shuffle operations that mixes information from different blocks similar to ShuffleNet. d) BlockDense with the first matrix as a block-diagonal and the second a low-rank or dense matrix.
  • Figure 3: Poor training dynamics: Training dynamics of LowRank with rank of 128 under different training configurations. Curves correspond to a 4-layer Transformer with a model width of 768 on WikiText-103. We apply self-guided training in the first half of training. Refer to \ref{['subsec:training_dynamics']} for more training dynamics visualizations of the other two structured parameterizations.
  • Figure 4: Scaling curves of structured matrices with a linear fit for better illustration. The dense model is trained at its optimal trade-off while we train structured FFNs on the same number of tokens and retain 63% or 32% of the original parameters. 1) Structured matrices have steeper scaling curves with much closer results at larger sizes, showing good scaling behavior. 2) With the same training FLOPs, these curves indicate that structured matrices can have fewer parameters and lower validation loss when the x-axis is further extended.
  • Figure 5: Zero-shot performance on downstream tasks in the overtraining regime. The wide and structured networks are built upon dense ones by applying LowRank to the FFN and reducing the number of attention heads to make the entire network structured.
  • ...and 9 more figures