CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

Boao Kong; Junzhu Liang; Yuxi Liu; Renjia Deng; Kun Yuan

CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

Boao Kong, Junzhu Liang, Yuxi Liu, Renjia Deng, Kun Yuan

TL;DR

The paper tackles the high memory and compute burden of pretraining large transformers by revealing that inter-layer activation differences possess a strong low-rank structure. It proposes CR-Net, which reconstructs each layer's activation from the previous layer plus a low-rank difference, using two small matrices per projection and a learnable cross-layer scaling to preserve high-rank information with far fewer parameters. An activation-efficient recomputation strategy is developed to minimize activation storage during backpropagation, enabling scalable training with reduced memory and compute. Empirical results across model scales from tens of millions to billions of parameters show CR-Net outperforms existing low-rank approaches in validation perplexity while cutting parameter counts and computation, and its re-computation variant further improves memory efficiency. Overall, CR-Net provides a principled, scalable framework for memory- and compute-efficient LLM pretraining and offers a path to generalizing cross-layer low-rank ideas to other transformer variants.

Abstract

Low-rank architectures have become increasingly important for efficient large language model (LLM) pre-training, providing substantial reductions in both parameter complexity and memory/computational demands. Despite these advantages, current low-rank methods face three critical shortcomings: (1) compromised model performance, (2) considerable computational overhead, and (3) limited activation memory savings. To address these limitations, we propose Cross-layer Low-Rank residual Network (CR-Net), an innovative parameter-efficient framework inspired by our discovery that inter-layer activation residuals possess low-rank properties. CR-Net implements this insight through a dual-path architecture that efficiently reconstructs layer activations by combining previous-layer outputs with their low-rank differences, thereby maintaining high-rank information with minimal parameters. We further develop a specialized activation recomputation strategy tailored for CR-Net that dramatically reduces memory requirements. Extensive pre-training experiments across model scales from 60M to 7B parameters demonstrate that CR-Net consistently outperforms state-of-the-art low-rank frameworks while requiring fewer computational resources and less memory.

CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

TL;DR

Abstract

CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (4)