DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models

Xiaolin Hu; Xiang Cheng; Peiyu Liu; Wei Liu; Jian Luan; Bin Wang; Yong Liu

DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models

Xiaolin Hu, Xiang Cheng, Peiyu Liu, Wei Liu, Jian Luan, Bin Wang, Yong Liu

TL;DR

DoTA addresses the high cost of fine-tuning LLMs by introducing weight-decomposed tensor updates built on Matrix Product Operator decomposition of pre-trained weights. Unlike prior tensor adaptations that initialize randomly, DoTA uses MPO-based initialization and a frozen residual to preserve base-model knowledge, with a residual term $W_{\text{res}} = W_0 - \text{MPO}(W_0)$. QDoTA extends this approach to 4-bit NF4 quantization, drastically reducing memory usage while maintaining performance close to full-precision DoTA. Empirical results on commonsense and arithmetic reasoning with LLaMA models show DoTA and QDoTA outperform random-initialization baselines, surpass many PEFT methods with orders-of-magnitude fewer trainable parameters, and approach full fine-tuning on several tasks, validating the practicality and robustness of weight-decomposition-based fine-tuning.

Abstract

Low-rank adaptation (LoRA) reduces the computational and memory demands of fine-tuning large language models (LLMs) by approximating updates with low-rank matrices. However, low-rank approximation in two-dimensional space fails to capture high-dimensional structures within the target matrix. Recently, tensor decomposition methods have been explored for fine-tuning LLMs, leveraging their ability to extract structured information. Yet, these approaches primarily rely on random initialization, and the impact of initialization on tensor adaptation remains underexplored. In this paper, we reveal that random initialization significantly diverges from the validation loss achieved by full fine-tuning. To address this, we propose Weight-Decomposed Tensor Adaptation (DoTA), which leverages the Matrix Product Operator (MPO) decomposition of pre-trained weights for effective initialization in fine-tuning LLMs. Additionally, we introduce QDoTA, a quantized version of DoTA designed for 4-bit quantization. Experiments on commonsense and arithmetic reasoning tasks show that DoTA outperforms random initialization methods with fewer parameters. QDoTA further reduces memory consumption and achieves comparable performance to DoTA on commonsense reasoning tasks. We will release our code to support future research.

DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models

TL;DR

. QDoTA extends this approach to 4-bit NF4 quantization, drastically reducing memory usage while maintaining performance close to full-precision DoTA. Empirical results on commonsense and arithmetic reasoning with LLaMA models show DoTA and QDoTA outperform random-initialization baselines, surpass many PEFT methods with orders-of-magnitude fewer trainable parameters, and approach full fine-tuning on several tasks, validating the practicality and robustness of weight-decomposition-based fine-tuning.

Abstract

Paper Structure (18 sections, 7 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 7 equations, 7 figures, 2 tables, 1 algorithm.

Introduction
Preliminaries
Tensor and Tensor Operations
Matrix Product Operator
The Proposed Method
MPO-based High-Dimensional Structure Approximation
Weight-Decomposed Tensor Adaptation
QDoTA: DoTA with Quantization
Experiments
Commonsense Reasoning
Mathematical Reasoning
Quantitative Analysis
Ablation Study
Analyzing the Impact of Rank
Related Work
...and 3 more sections

Figures (7)

Figure 1: Comparison of the number of trainable parameters and performance across different methods on commonsense reasoning tasks using the LLaMA3-8B model.
Figure 2: Impact of different initialization methods on evaluation loss. "DoTA-Random" indicates we randomly initialized tensors with the same shape as DoTA.
Figure 3: The architecture of the proposed method. DoTA decomposes the pre-trained weight matrix $\mathbf{W}_0$ into trainable tensors $\{\mathbf{\mathcal{T}}^{(k)}\}_{k=1}^N$ using MPO decomposition. The sequence product of $\{\mathbf{\mathcal{T}}^{(k)}\}_{k=1}^N$ reconstruct matrix $\tilde{\mathbf{W}}$. A residual matrix $\mathbf{W}_{\text{res}}$ is formed by subtracting the reconstructed matrix $\tilde{\mathbf{W}}$ from the original $\mathbf{W}_0.$
Figure 4: Performance of different methods on mathematical reasoning tasks using the LLaMA2-7B model
Figure 5: Comparison of the quantized versions of various methods on eight commonsense reasoning tasks using the LLaMA3-8B model. QPiSSA and QLoRA use 0.7% of the parameters required for full fine-tuning, while DoTA uses 0.2%.
...and 2 more figures

DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models

TL;DR

Abstract

DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)