Table of Contents
Fetching ...

CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation

Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Mingsong Yan, Zi Yang, Paul Hovland, Bogdan Nicolae, Franck Cappello, Sui Tang, Zheng Zhang

TL;DR

CoLA introduces a compute-efficient architecture for pre-training LLMs by replacing full-size MLPs and projection layers with bottleneck auto-encoders that enforce low-rank activations. The approach is complemented by CoLA-M, a memory-efficient variant that minimizes activation storage via strategic recomputation. Theoretical results show nonlinear activations can yield better low-rank representations under data-dependent conditions, and an effective-rank–aware bound clarifies when CoLA is advantageous. Empirically, CoLA achieves about 2× reductions in parameters and FLOPs with full-rank-level performance, while CoLA-M further boosts memory savings and throughput; inference also benefits with lower latency and memory cost. Overall, CoLA and CoLA-M offer substantial practical efficiency gains for dense LLM pre-training and deployment, with potential extensions to mixture-of-experts models.

Abstract

The full-size MLPs and the projection layers in attention introduce tremendous model sizes of large language models (LLMs), consuming extensive computational resources in pre-training. We empirically observe that the activations of pre-trained LLMs exhibit low-rank property. Motivated by such observations, we propose CoLA and its memory-efficient implementation, CoLA-M, to replace these full-size layers with compute-efficient auto-encoders that naturally enforce low-rank activations throughout training. This fundamental architectural change eliminates the activation redundancy and significantly boosts model capacity and training efficiency. Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by $\bf 2\pmb{\times}$ and improves training throughput by $\bf 1.86\pmb{\times}$ while maintaining full-rank level performance. CoLA-M further squeezes memory cost without sacrificing throughput, offering a pre-training approach with collectively superior parameter, computing, and memory efficiency. The LLMs produced are also $\bf 2\pmb{\times}$ smaller, enabling faster inference with lower memory cost on resource-constrained platforms.

CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation

TL;DR

CoLA introduces a compute-efficient architecture for pre-training LLMs by replacing full-size MLPs and projection layers with bottleneck auto-encoders that enforce low-rank activations. The approach is complemented by CoLA-M, a memory-efficient variant that minimizes activation storage via strategic recomputation. Theoretical results show nonlinear activations can yield better low-rank representations under data-dependent conditions, and an effective-rank–aware bound clarifies when CoLA is advantageous. Empirically, CoLA achieves about 2× reductions in parameters and FLOPs with full-rank-level performance, while CoLA-M further boosts memory savings and throughput; inference also benefits with lower latency and memory cost. Overall, CoLA and CoLA-M offer substantial practical efficiency gains for dense LLM pre-training and deployment, with potential extensions to mixture-of-experts models.

Abstract

The full-size MLPs and the projection layers in attention introduce tremendous model sizes of large language models (LLMs), consuming extensive computational resources in pre-training. We empirically observe that the activations of pre-trained LLMs exhibit low-rank property. Motivated by such observations, we propose CoLA and its memory-efficient implementation, CoLA-M, to replace these full-size layers with compute-efficient auto-encoders that naturally enforce low-rank activations throughout training. This fundamental architectural change eliminates the activation redundancy and significantly boosts model capacity and training efficiency. Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by and improves training throughput by while maintaining full-rank level performance. CoLA-M further squeezes memory cost without sacrificing throughput, offering a pre-training approach with collectively superior parameter, computing, and memory efficiency. The LLMs produced are also smaller, enabling faster inference with lower memory cost on resource-constrained platforms.

Paper Structure

This paper contains 33 sections, 8 theorems, 42 equations, 12 figures, 12 tables.

Key Result

Proposition 3.1

If $\sigma(0)=0$ and $\sigma'(0)\neq0$, then $\mathcal{E}_{\sigma}(r)\leq \mathcal{E}_{\mathrm{id}}(r)$.

Figures (12)

  • Figure 1: Comparison between various pre-training methods on a LLaMA-1B model with a token batch size of 256. Among them, CoLA is the only one that reduces both compute FLOPs and model size while demonstrating on par validation perplexity with full-rank training.
  • Figure 2: MLP activation [i.e., Eq.\ref{['eq:full-rank-fwd']}] spectrum of the pre-trained GPT-2 small radford2019language. Model activations are evaluated on the WikiText2 dataset. a) The singular value decay across different decoder blocks. b) The full dimension vs. effective rank ($\alpha=0.95$).
  • Figure 3: Comparison between different pre-training frameworks. a) LoRA/ReLoRA lialin2023relora freezes a full-rank weight; b) GaLore zhao2024galore only reduces optimizer states by down and up projecting gradients; c) SLTrain han2024sltrain requires reconstruction of the low-rank and sparse matrices; d) CoLA (ours) is a pure low-rank architecture involving only rank $r$ weight matrices.
  • Figure 4: A decoder block in CoLA with LLaMA-like architecture (layer norms, rotary positional embeddings are omitted for simplicity). All MLP layers and projection layers in attention are replaced with auto-encoders. Modules painted in sketch are the re-computations during the backward step of CoLA-M (a memory efficient implementation of CoLA).
  • Figure 5: Memory breakdown for LLaMA-1B using fairly large sequence batch sizes in pre-training. The activation memory is at dominant place.
  • ...and 7 more figures

Theorems & Definitions (16)

  • Proposition 3.1
  • Proposition 3.2
  • Theorem 3.3
  • Theorem 3.4
  • proof : Proof of Proposition \ref{['prop: erho <= eone']}
  • proof : Proof of Proposition \ref{['prop: rho Xu not in col(X)']}
  • proof : Proof of Theorem \ref{['theorem: Erho < Eid']}
  • proof : Proof of Theorem \ref{['theorem: effective rank']}
  • Lemma H.1
  • proof
  • ...and 6 more