LoRTA: Low Rank Tensor Adaptation of Large Language Models

Ignacio Hounie; Charilaos Kanatsoulis; Arnuv Tandon; Alejandro Ribeiro

LoRTA: Low Rank Tensor Adaptation of Large Language Models

Ignacio Hounie, Charilaos Kanatsoulis, Arnuv Tandon, Alejandro Ribeiro

TL;DR

LoRTA tackles the high parameter cost of fine-tuning large language models by introducing a $5$-way CANDECOMP/PARAFAC (CPD) tensor parameterization that unifies updates across layers, attention heads, and Q/K/V/P matrices. By representing all weight updates as a single high-order tensor and learning all factor matrices jointly, LoRTA achieves substantial parameter reductions without compromising performance across tasks including NLU, instruction tuning, preference optimization, and protein folding. Empirical results show LoRTA matching or surpassing state-of-the-art tensor-based PEFT methods with dramatically fewer trainable parameters (often by one or two orders of magnitude) and favorable I/O characteristics for concurrent adapters. The work demonstrates the practical potential of high-order tensor adapters for scalable, multi-task LLM fine-tuning and points to future directions like MoE integration and more efficient tensor operations.

Abstract

Low Rank Adaptation (LoRA) is a popular Parameter Efficient Fine Tuning (PEFT) method that effectively adapts large pre-trained models for downstream tasks. LoRA parameterizes model updates using low-rank matrices at each layer, significantly reducing the number of trainable parameters and, consequently, resource requirements during fine-tuning. However, the lower bound on the number of trainable parameters remains high due to the use of the low-rank matrix model. Recent works have addressed this limitation by proposing low rank tensor parameterizations for model updates. However, they only exploit redundancy across layers, or tensorize individual matrices using ad-hoc schemes that introduce additional hyperparameters. In this work, we propose a higher-order Candecomp/Parafac (CP) decomposition, enabling a more compact and flexible representation compared to existing matrix and tensor based PEFT methods. Our experiments on Natural Language Understanding, Instruction Tuning, Preference Optimization and Protein Folding benchmarks demonstrate that our method can achieve a reduction in the number of parameters while maintaining comparable performance.

LoRTA: Low Rank Tensor Adaptation of Large Language Models

TL;DR

LoRTA tackles the high parameter cost of fine-tuning large language models by introducing a

-way CANDECOMP/PARAFAC (CPD) tensor parameterization that unifies updates across layers, attention heads, and Q/K/V/P matrices. By representing all weight updates as a single high-order tensor and learning all factor matrices jointly, LoRTA achieves substantial parameter reductions without compromising performance across tasks including NLU, instruction tuning, preference optimization, and protein folding. Empirical results show LoRTA matching or surpassing state-of-the-art tensor-based PEFT methods with dramatically fewer trainable parameters (often by one or two orders of magnitude) and favorable I/O characteristics for concurrent adapters. The work demonstrates the practical potential of high-order tensor adapters for scalable, multi-task LLM fine-tuning and points to future directions like MoE integration and more efficient tensor operations.

Abstract

Paper Structure (33 sections, 24 equations, 7 figures, 15 tables)

This paper contains 33 sections, 24 equations, 7 figures, 15 tables.

Introduction
Preliminaries
Transformer Architecture
Low Rank (matrix) Adaptation
Tensor Algebra
Low Rank Tensor adaptation
Parameter sharing across layers
LoRTA: A more efficient tensor model
Other Low Rank Tensor models in PEFT
Experiments
Natural Language Understanding
Instruction Tuning
Preference Optimization
Protein Folding
Computational Advantages
...and 18 more sections

Figures (7)

Figure 1: Illustration of a rank 1 adapter for a single weight matrix with multiple heads. (Left) The LoRA update for head $h$ is computed as $d \bm W_h = \bm b_h \circ \bm a$. (Right) The update using a third order low rank tensor model is computed as $dW_h = \bm b \circ \bm c[h] \circ \bm a$. Both models have the same tensor rank, but the latter has less parameters.
Figure 2: Mean cross-entropy loss on training and testing data for Llama2-7b on the Alpaca dataset vs number of trainable parameters for different adapter ranks. Lower is better. Numbers on top of markers denote the adapter rank.
Figure 3: Performance on MT-Bench mtbench for Llama2-7b touvron2023llama models fine-tuned with LoRA and LoRTA. Higher is better. Left: Average score across all questions vs number of trainable parameters. Numbers on top of markers denote the adapter rank. Right: Average score per task.
Figure 4: (Left) Mean DPO loss on held-out data from the orca dpo pairs dataset vs number of trainable parameters, lower is better. (Right) MT-Bench average scores Scores vs number of trainable parameters, higher is better.
Figure 5: Performance on MT-Bench mtbench for llama2-7b touvron2023llama models fine-tuned with LoRA and LoRTA using dpo on intel orca pairs. Average score per task. Higher is better.
...and 2 more figures

LoRTA: Low Rank Tensor Adaptation of Large Language Models

TL;DR

Abstract

LoRTA: Low Rank Tensor Adaptation of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)