Table of Contents
Fetching ...

LaX: Boosting Low-Rank Training of Foundation Models via Latent Crossing

Ruijie Zhang, Ziyue Liu, Zhengyang Wang, Zheng Zhang

TL;DR

LaX introduces Latent Crossing, a lightweight plug-in that enables information flow across low-rank subspaces to restore expressiveness without increasing rank. The method uses residual connections and configurable gates to adapt to different low-rank structures (e.g., SVD, CoLA, TT) and remains compatible with LoRA for efficient fine-tuning. Empirical results across ViT and LLaMA-like models show LaX closes much of the gap between low-rank and full-rank baselines, with gains in pretraining accuracy and perplexity and improvements on arithmetic and commonsense reasoning during fine-tuning. The work offers a practical, generalizable approach to training efficient foundation models, reducing compute while preserving or enhancing performance.

Abstract

Training foundation models such as ViTs and LLMs requires tremendous computing cost. Low-rank matrix or tensor factorization offers a parameter-efficient alternative, but often downgrades performance due to the restricted parameter space. In this work, we introduce {\textbf{Latent Crossing (LaX)}} -- a simple yet effective plug-and-play module that enhances the capacity of low-rank models by enabling information flow across low-rank subspaces. We extensively validate the benefits of LaX on pre-training tasks with ViT-Base/Large and LLaMA-like models ranging from 60M to 1B parameters. LaX boosts low-rank model performance to match or exceed the full-rank baselines while using 2-3\(\times\) fewer parameters. When equipped with low-rank adapters (i.e., LoRA) for fine-tuning LLaMA-7/13B, LaX consistently improves performance on arithmetic and common sense reasoning tasks with negligible cost.

LaX: Boosting Low-Rank Training of Foundation Models via Latent Crossing

TL;DR

LaX introduces Latent Crossing, a lightweight plug-in that enables information flow across low-rank subspaces to restore expressiveness without increasing rank. The method uses residual connections and configurable gates to adapt to different low-rank structures (e.g., SVD, CoLA, TT) and remains compatible with LoRA for efficient fine-tuning. Empirical results across ViT and LLaMA-like models show LaX closes much of the gap between low-rank and full-rank baselines, with gains in pretraining accuracy and perplexity and improvements on arithmetic and commonsense reasoning during fine-tuning. The work offers a practical, generalizable approach to training efficient foundation models, reducing compute while preserving or enhancing performance.

Abstract

Training foundation models such as ViTs and LLMs requires tremendous computing cost. Low-rank matrix or tensor factorization offers a parameter-efficient alternative, but often downgrades performance due to the restricted parameter space. In this work, we introduce {\textbf{Latent Crossing (LaX)}} -- a simple yet effective plug-and-play module that enhances the capacity of low-rank models by enabling information flow across low-rank subspaces. We extensively validate the benefits of LaX on pre-training tasks with ViT-Base/Large and LLaMA-like models ranging from 60M to 1B parameters. LaX boosts low-rank model performance to match or exceed the full-rank baselines while using 2-3 fewer parameters. When equipped with low-rank adapters (i.e., LoRA) for fine-tuning LLaMA-7/13B, LaX consistently improves performance on arithmetic and common sense reasoning tasks with negligible cost.

Paper Structure

This paper contains 31 sections, 6 equations, 9 figures, 23 tables.

Figures (9)

  • Figure 1: LaX boosts the performance of low-rank training methods. (a) SVD-based pre-training ViT-B on ImageNet-1K with different matrix ranks: lower-rank leads to greater performance drop; LaX consistently improves the performance in all settings. (b) Pre-training ViT-B on ImageNet-1K with different low-rank methods. LaX significantly improves performance for all low-rank methods, even surpassing the full-rank pre-training. (c) Fine-tuning LLaMA-7B on commonsense reasoning tasks using LoRA, with and without LaX respectively. LaX improves LoRA's fine-tuning performance in all tasks.
  • Figure 2: LaX is a general module that can be plugged into low-rank neural network models. (a) Dense layers: full information flow, effective but computationally expensive. (b) SVD/CoLAliu2025cola layers: rank-$r$ bottlenecks with two factors; LaX can be inserted into the latent space between layers. (c) Tensor-train layers: bottleneck structure with four tensor cores, where data flow is governed by tensor contractions; LaX can be applied either between cores or across layers. (d) LoRA adapters: LaX can be placed between different adapters.
  • Figure 3: A 6-core Tensor Train layer with the symmetric setting. For Tensor Train layers with identical input and output shapes, we can naturally arrange the tensor ranks in a symmetric configuration, where $r_0=r_4$ and $r_1=r_3$ in this example. This reduces the need for shape transformation operations, making Intra-Layer LaX more efficient when applied.
  • Figure 4: LaX Gate.
  • Figure 5: Two-Core Tensor Gate. A residual tensor $\mathbfcal{R} \in \mathbb{R}^{r_0 \times 1 \times r_1}$ is contracted with two gating tensor cores, $\mathbfcal{C}^0$ and $\mathbfcal{C}^1$, producing a transformed residual tensor $\mathbfcal{R}^{'} \in \mathbb{R}^{r^{'}_0 \times 1 \times r^{'}_1}$.
  • ...and 4 more figures