Table of Contents
Fetching ...

CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

Boao Kong, Junzhu Liang, Yuxi Liu, Renjia Deng, Kun Yuan

TL;DR

The paper tackles the high memory and compute burden of pretraining large transformers by revealing that inter-layer activation differences possess a strong low-rank structure. It proposes CR-Net, which reconstructs each layer's activation from the previous layer plus a low-rank difference, using two small matrices per projection and a learnable cross-layer scaling to preserve high-rank information with far fewer parameters. An activation-efficient recomputation strategy is developed to minimize activation storage during backpropagation, enabling scalable training with reduced memory and compute. Empirical results across model scales from tens of millions to billions of parameters show CR-Net outperforms existing low-rank approaches in validation perplexity while cutting parameter counts and computation, and its re-computation variant further improves memory efficiency. Overall, CR-Net provides a principled, scalable framework for memory- and compute-efficient LLM pretraining and offers a path to generalizing cross-layer low-rank ideas to other transformer variants.

Abstract

Low-rank architectures have become increasingly important for efficient large language model (LLM) pre-training, providing substantial reductions in both parameter complexity and memory/computational demands. Despite these advantages, current low-rank methods face three critical shortcomings: (1) compromised model performance, (2) considerable computational overhead, and (3) limited activation memory savings. To address these limitations, we propose Cross-layer Low-Rank residual Network (CR-Net), an innovative parameter-efficient framework inspired by our discovery that inter-layer activation residuals possess low-rank properties. CR-Net implements this insight through a dual-path architecture that efficiently reconstructs layer activations by combining previous-layer outputs with their low-rank differences, thereby maintaining high-rank information with minimal parameters. We further develop a specialized activation recomputation strategy tailored for CR-Net that dramatically reduces memory requirements. Extensive pre-training experiments across model scales from 60M to 7B parameters demonstrate that CR-Net consistently outperforms state-of-the-art low-rank frameworks while requiring fewer computational resources and less memory.

CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

TL;DR

The paper tackles the high memory and compute burden of pretraining large transformers by revealing that inter-layer activation differences possess a strong low-rank structure. It proposes CR-Net, which reconstructs each layer's activation from the previous layer plus a low-rank difference, using two small matrices per projection and a learnable cross-layer scaling to preserve high-rank information with far fewer parameters. An activation-efficient recomputation strategy is developed to minimize activation storage during backpropagation, enabling scalable training with reduced memory and compute. Empirical results across model scales from tens of millions to billions of parameters show CR-Net outperforms existing low-rank approaches in validation perplexity while cutting parameter counts and computation, and its re-computation variant further improves memory efficiency. Overall, CR-Net provides a principled, scalable framework for memory- and compute-efficient LLM pretraining and offers a path to generalizing cross-layer low-rank ideas to other transformer variants.

Abstract

Low-rank architectures have become increasingly important for efficient large language model (LLM) pre-training, providing substantial reductions in both parameter complexity and memory/computational demands. Despite these advantages, current low-rank methods face three critical shortcomings: (1) compromised model performance, (2) considerable computational overhead, and (3) limited activation memory savings. To address these limitations, we propose Cross-layer Low-Rank residual Network (CR-Net), an innovative parameter-efficient framework inspired by our discovery that inter-layer activation residuals possess low-rank properties. CR-Net implements this insight through a dual-path architecture that efficiently reconstructs layer activations by combining previous-layer outputs with their low-rank differences, thereby maintaining high-rank information with minimal parameters. We further develop a specialized activation recomputation strategy tailored for CR-Net that dramatically reduces memory requirements. Extensive pre-training experiments across model scales from 60M to 7B parameters demonstrate that CR-Net consistently outperforms state-of-the-art low-rank frameworks while requiring fewer computational resources and less memory.

Paper Structure

This paper contains 40 sections, 1 theorem, 55 equations, 9 figures, 22 tables, 1 algorithm.

Key Result

Theorem 1

Suppose Assumption assunption: Layerwise Cosine Correlation Bound holds. Then there exists $r_0>0$ such that the approximation $\widetilde{Y}_{l,\beta}^{\text{P}}$ obtained by Eq. equation: matrix recovery has a lower error than the direct low-rank approximation $\text{LR}_r({Y}_l^{\text{P}})$ by a

Figures (9)

  • Figure 1: Illustration of CR-Net base on LLaMA-2 architecture with $L$ transformer layers. Layer normalization and ROPE are omitted for simplicity.
  • Figure 2: The average relative error of activation recovery by using low-rank approximation and using \ref{['equation: matrix recovery']} over all transformer layers. (Left: LLaMA-3 8B, right: GPT-2 small.)
  • Figure 3: The average throughput (tokens/s) for each device of different algorithms. (Left: LLaMA-2 1B pre-training on an Nvidia A100 40G GPU, with results of other comparable methods from liu2025cola. Middle: LLaMA-2 1B inference on a Nvidia A100 80G GPU. Right: LLaMA-2 7B pre-training on a Nvidia A100 80G GPU.)
  • Figure 4: The evaluation perplexity for CR-Net in training LLaMA-2 350M model with different strategies of rank selection.
  • Figure 5: The Comparison of evaluation perplexity for CR-Net in training LLaMA-2 350M with fixed $\beta_l^{\text{P}}$ and learnable $\beta_l^{\text{P}}$.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Definition 1: Stable rank
  • Remark 1
  • Theorem 1
  • proof