Table of Contents
Fetching ...

LORA-CRAFT: Cross-layer Rank Adaptation via Frozen Tucker Decomposition of Pre-trained Attention Weights

Kasun Dewage, Marianna Pensky, Suranadi De Silva, Shankadeep Mondal

TL;DR

CRAFT introduces a parameter-efficient fine-tuning approach that operates on pre-trained attention weights by stacking them across layers and applying a full Tucker-3 decomposition. All Tucker factors are frozen, and training occurs through small residual adapters $J^{(n)}$ that act on each mode, yielding a trainable budget of $n_p(r_1^2 + r_2^2 + r_3^2)$ independent of model size $d$ and depth $N_L$ at fixed ranks. The method preserves the original weights at initialization via a residual formulation, enabling stable fine-tuning, and demonstrates competitive GLUE performance on RoBERTa-base and RoBERTa-large with roughly 41K trainable parameters—significantly fewer than competing PEFT methods. Empirically, CRAFT achieves strong efficiency-accuracy trade-offs, particularly for large models, while offering storage benefits through compact factor representations. Limitations include scope to RoBERTa/GLUE and the need to explore rank-scaling for substantially larger architectures and generation tasks.

Abstract

We introduce CRAFT (Cross-layer Rank Adaptation via Frozen Tucker), a parameter-efficient fine-tuning (PEFT) method that applies Tucker tensor decomposition to pre-trained attention weight matrices stacked across transformer layers and trains only small square adaptation matrices on the resulting frozen Tucker factors. Existing tensor-based PEFT methods decompose gradient updates: LoTR applies Tucker decomposition with shared factor matrices, while SuperLoRA groups and reshapes $ΔW$ across layers before applying Tucker decomposition. Separately, methods like PiSSA apply SVD to pre-trained weights but operate independently per layer. CRAFT bridges these two lines of work: it performs full Tucker decomposition via Higher-Order SVD (HOSVD) directly on pre-trained weights organized as cross-layer 3D tensors, freezes all resulting factors, and adapts the model through lightweight trainable transformations applied to each factor matrix. Experiments on the GLUE benchmark using RoBERTa-base and RoBERTa-large demonstrate that CRAFT achieves competitive performance with existing methods while requiring only 41K Tucker adaptation parameters--a count independent of model dimension and depth at fixed Tucker ranks.

LORA-CRAFT: Cross-layer Rank Adaptation via Frozen Tucker Decomposition of Pre-trained Attention Weights

TL;DR

CRAFT introduces a parameter-efficient fine-tuning approach that operates on pre-trained attention weights by stacking them across layers and applying a full Tucker-3 decomposition. All Tucker factors are frozen, and training occurs through small residual adapters that act on each mode, yielding a trainable budget of independent of model size and depth at fixed ranks. The method preserves the original weights at initialization via a residual formulation, enabling stable fine-tuning, and demonstrates competitive GLUE performance on RoBERTa-base and RoBERTa-large with roughly 41K trainable parameters—significantly fewer than competing PEFT methods. Empirically, CRAFT achieves strong efficiency-accuracy trade-offs, particularly for large models, while offering storage benefits through compact factor representations. Limitations include scope to RoBERTa/GLUE and the need to explore rank-scaling for substantially larger architectures and generation tasks.

Abstract

We introduce CRAFT (Cross-layer Rank Adaptation via Frozen Tucker), a parameter-efficient fine-tuning (PEFT) method that applies Tucker tensor decomposition to pre-trained attention weight matrices stacked across transformer layers and trains only small square adaptation matrices on the resulting frozen Tucker factors. Existing tensor-based PEFT methods decompose gradient updates: LoTR applies Tucker decomposition with shared factor matrices, while SuperLoRA groups and reshapes across layers before applying Tucker decomposition. Separately, methods like PiSSA apply SVD to pre-trained weights but operate independently per layer. CRAFT bridges these two lines of work: it performs full Tucker decomposition via Higher-Order SVD (HOSVD) directly on pre-trained weights organized as cross-layer 3D tensors, freezes all resulting factors, and adapts the model through lightweight trainable transformations applied to each factor matrix. Experiments on the GLUE benchmark using RoBERTa-base and RoBERTa-large demonstrate that CRAFT achieves competitive performance with existing methods while requiring only 41K Tucker adaptation parameters--a count independent of model dimension and depth at fixed Tucker ranks.
Paper Structure (16 sections, 1 theorem, 10 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 1 theorem, 10 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

CRAFT with Tucker ranks $(r_1, r_2, r_3)$ applied to $n_p$ projection types has exactly Tucker adaptation parameters (excluding the task-specific classifier head), independent of model dimension $d$ and depth $N_L$ for fixed ranks. For a model with $N_L$ layers and dimension $d$ (where $d_{out} \asymp d_{in} \asymp d$), the trainable parameter counts of competing methods are: CRAFT is the only m

Figures (4)

  • Figure 1: PCA of Vectorized Attention Weights Across Models and Layers. Each point represents one row of a weight matrix $W_\alpha^{(\ell)}$, projected onto the first two principal components computed from the pooled set $\{w_\alpha^{(\ell)}\}$ at each layer. Q (pink) exhibits higher dispersion (Eq. \ref{['eq:dispersion']}), while K (olive) and V (green) concentrate near the origin. The two-component explained-variance ratio (shown in titles) decreases in deeper layers for ViT and GPT-2, indicating that the weight distribution spreads over a higher effective dimensionality.
  • Figure 2: CRAFT Architecture Overview. Pre-trained attention weights (Q, V) are stacked across $N_L$ layers into 3D tensors. HOSVD decomposes each tensor into a core tensor $\mathcal{G}$ and factor matrices $U^{(1)}, U^{(2)}, U^{(3)}$. All decomposition factors are frozen. Adaptation occurs only through small square matrices $J^{(1)} \in \mathbb{R}^{r_1 \times r_1}$, $J^{(2)} \in \mathbb{R}^{r_2 \times r_2}$, $J^{(3)} \in \mathbb{R}^{r_3 \times r_3}$ (shown in red), initialized near identity. The original pre-trained weights $\mathcal{W}$ are preserved exactly at initialization. This yields a trainable parameter count of $2(r_1^2 + r_2^2 + r_3^2)$---independent of model dimension $d$ and depth $N_L$ at fixed Tucker ranks.
  • Figure 3: PEFT Method Taxonomy. CRAFT uniquely combines cross-layer tensor structure with pre-trained weight decomposition.
  • Figure 4: Parameter Number vs. Model Depth. At fixed Tucker ranks, CRAFT's Tucker adaptation parameter count remains constant regardless of model depth, while LoRA and PiSSA scale linearly with $N_L$. Whether the same ranks suffice for significantly deeper models is an open question (see Section \ref{['sec:discussion']}).

Theorems & Definitions (6)

  • Definition 1: Mode-$n$ Unfolding
  • Definition 2: Mode-$n$ Product
  • Definition 3: Tucker Decomposition
  • Remark 1: Weight Preservation at Initialization
  • Proposition 1: Trainable Parameter Count and Scaling Comparison
  • Remark 2