LoRTA: Low Rank Tensor Adaptation of Large Language Models
Ignacio Hounie, Charilaos Kanatsoulis, Arnuv Tandon, Alejandro Ribeiro
TL;DR
LoRTA tackles the high parameter cost of fine-tuning large language models by introducing a $5$-way CANDECOMP/PARAFAC (CPD) tensor parameterization that unifies updates across layers, attention heads, and Q/K/V/P matrices. By representing all weight updates as a single high-order tensor and learning all factor matrices jointly, LoRTA achieves substantial parameter reductions without compromising performance across tasks including NLU, instruction tuning, preference optimization, and protein folding. Empirical results show LoRTA matching or surpassing state-of-the-art tensor-based PEFT methods with dramatically fewer trainable parameters (often by one or two orders of magnitude) and favorable I/O characteristics for concurrent adapters. The work demonstrates the practical potential of high-order tensor adapters for scalable, multi-task LLM fine-tuning and points to future directions like MoE integration and more efficient tensor operations.
Abstract
Low Rank Adaptation (LoRA) is a popular Parameter Efficient Fine Tuning (PEFT) method that effectively adapts large pre-trained models for downstream tasks. LoRA parameterizes model updates using low-rank matrices at each layer, significantly reducing the number of trainable parameters and, consequently, resource requirements during fine-tuning. However, the lower bound on the number of trainable parameters remains high due to the use of the low-rank matrix model. Recent works have addressed this limitation by proposing low rank tensor parameterizations for model updates. However, they only exploit redundancy across layers, or tensorize individual matrices using ad-hoc schemes that introduce additional hyperparameters. In this work, we propose a higher-order Candecomp/Parafac (CP) decomposition, enabling a more compact and flexible representation compared to existing matrix and tensor based PEFT methods. Our experiments on Natural Language Understanding, Instruction Tuning, Preference Optimization and Protein Folding benchmarks demonstrate that our method can achieve a reduction in the number of parameters while maintaining comparable performance.
