Table of Contents
Fetching ...

MetaTT: A Global Tensor-Train Adapter for Parameter-Efficient Fine-Tuning

Javier Lopez-Piqueres, Pranav Deshpande, Archan Ray, Mattia J. Villani, Marco Pistoia, Niraj Kumar

TL;DR

MetaTT introduces a global transformer adapter built from Tensor Train (TT) decompositions to perform parameter-efficient fine-tuning across layers, matrix types, heads, and tasks. By factorizing all linear sub-modules into a single TT, MetaTT achieves substantial parameter compression, with counts scaling as the sum of TT modes rather than their product, and extends to multi-task learning by incorporating a dedicated task-mode core. The approach includes a DMRG-inspired rank-adaptive optimizer that progressively reduces TT ranks during training, improving optimization and generalization. Empirical results across single-task and multi-task benchmarks show MetaTT is competitive with or closer to LoRA performance while reducing trainable parameters by factors of 2–30×, and it benefits from rank-adaptive training, particularly for larger models. These findings suggest TT-based global adapters offer scalable, efficient fine-tuning for large language models in resource-constrained settings, with strong potential for extension to broader tensor-network architectures and training regimes.

Abstract

We present MetaTT, a Tensor Train (TT) adapter framework for fine-tuning of pre-trained transformers. MetaTT enables flexible and parameter-efficient model adaptation by using a single shared TT to factorize transformer sub-modules. This factorization indexes key structural dimensions, including layer and matrix type, and can optionally incorporate heads and tasks. This design allows MetaTT's parameter count to scale with the sum, rather than the product, of the modes, resulting in a substantially more compact adapter. Our benchmarks compare MetaTT with LoRA along with recent state-of-the-art matrix and tensor decomposition based fine-tuning methods. We observe that when tested on single-task standard language modeling benchmarks, MetaTT achieves competitive parameter efficiency to accuracy tradeoff. We further demonstrate that MetaTT performs competitively when compared to state-of-the-art methods on multi-task learning. Finally, we leverage the TT-ansatz to design a rank adaptive optimizer inspired by the DMRG method from many-body physics. Our results demonstrate that integrating this approach with AdamW enhances optimization performance for a specified target rank.

MetaTT: A Global Tensor-Train Adapter for Parameter-Efficient Fine-Tuning

TL;DR

MetaTT introduces a global transformer adapter built from Tensor Train (TT) decompositions to perform parameter-efficient fine-tuning across layers, matrix types, heads, and tasks. By factorizing all linear sub-modules into a single TT, MetaTT achieves substantial parameter compression, with counts scaling as the sum of TT modes rather than their product, and extends to multi-task learning by incorporating a dedicated task-mode core. The approach includes a DMRG-inspired rank-adaptive optimizer that progressively reduces TT ranks during training, improving optimization and generalization. Empirical results across single-task and multi-task benchmarks show MetaTT is competitive with or closer to LoRA performance while reducing trainable parameters by factors of 2–30×, and it benefits from rank-adaptive training, particularly for larger models. These findings suggest TT-based global adapters offer scalable, efficient fine-tuning for large language models in resource-constrained settings, with strong potential for extension to broader tensor-network architectures and training regimes.

Abstract

We present MetaTT, a Tensor Train (TT) adapter framework for fine-tuning of pre-trained transformers. MetaTT enables flexible and parameter-efficient model adaptation by using a single shared TT to factorize transformer sub-modules. This factorization indexes key structural dimensions, including layer and matrix type, and can optionally incorporate heads and tasks. This design allows MetaTT's parameter count to scale with the sum, rather than the product, of the modes, resulting in a substantially more compact adapter. Our benchmarks compare MetaTT with LoRA along with recent state-of-the-art matrix and tensor decomposition based fine-tuning methods. We observe that when tested on single-task standard language modeling benchmarks, MetaTT achieves competitive parameter efficiency to accuracy tradeoff. We further demonstrate that MetaTT performs competitively when compared to state-of-the-art methods on multi-task learning. Finally, we leverage the TT-ansatz to design a rank adaptive optimizer inspired by the DMRG method from many-body physics. Our results demonstrate that integrating this approach with AdamW enhances optimization performance for a specified target rank.

Paper Structure

This paper contains 55 sections, 6 equations, 5 figures, 13 tables, 5 algorithms.

Figures (5)

  • Figure 1: Comparison between LoRA and MetaTT adapters. While LoRA parameterizes each weight matrix individually, MetaTT parameterizes all linear maps in the transformer architecture jointly as a TT (here shown only for a MHSA block). We propose two architectures for single-task fine-tuning: a) MetaTT-4D decomposes the entire set of linear maps into a TT of order $4$ along the input/output dimensions (as in LoRA) as well as along the layer dimension, $L$, and the set of projection matrices, $M$. b) MetaTT-5D further decomposes the output dimension along the head dimension and number of heads. To capture task dependencies in multi-task learning, we extend MetaTT by adding an additional tensor core with mode dimension $T$ corresponding to the number of tasks, resulting in c) MetaTT-(4+1)D. Unlike LoRA, TT ranks in MetaTT can adapt during fine-tuning, providing both parameter efficiency and optimization flexibility.
  • Figure 2: Comparison of AdamW and AdamW+\ref{['alg:DMRG']} sweeps applied at certain epochs. Results are shown for MetaTT-5D on MRPC and RTE for RoBERTa$_\text{base}$ and RoBERTa$_\text{large}$. In Adam we fix the rank throughout. For AdamW+\ref{['alg:DMRG']} we start with a $r=10$ TT and progressively decrease ranks until we reach $r=4$ as indicated by arrows on the plots for the base model, with the same schedule followed by the large counterparts. Error bars in both panels correspond to standard errors. The learning rate used across all the optimizers is $5e-4$ with $0$ weight decay.
  • Figure 3: Influence of task-dependent TT core in MTL. (Left): (Top): accuracy of MetaTT-(4+1)D as a function of epochs for RoBERTa$_{\text{Base}}$ for a single training realization (in the case of CoLA we compute Matthew's correlation instead). (Bottom): Corresponding normalized gradients across all tensors as a function of epochs (see \ref{['app:MTL']}). Task labels correspond to $0$: MRPC, $1$: RTE, $2$: CoLA. (Right): Same as in left but for RoBERTa$_{\text{Large}}$ as pretrained model.
  • Figure 4: Influence of task-dependent TT core in MTL. (Left): (Top): accuracy of MetaTT-(4+1)D as a function of epochs for RoBERTa$_{\text{Base}}$ for a single training realization (in the case of CoLA we compute Matthew's correlation instead). (Bottom): Corresponding normalized gradients across all tensors as a function of epochs (see \ref{['app:MTL']}). Task labels correspond to $0$: MRPC, $1$: QNLI, $2$: RTE, $3$: CoLA. (Right): Same as in left but for RoBERTa$_{\text{Large}}$ as pretrained model.
  • Figure 5: TT initialization performance. Shown are the accuracies in MRPC (left) and RTE (right) when training MetaTT-4D on RoBERTa$_\text{base}$ with different initialization strategies along with mean of best accuracies over $20$ epochs across $3$ different trials shown in the legend. Each pair of letters correspond to a different initialization strategy: 'ze' sets a given core to zero, 'id' sets each matrix slice of a core to the identity matrix and 'no' to a normal distribution with $\text{mean} = 0$ and $\text{standard deviation} = 0.2$. The order of pairs of letters follows the order of how each of the cores are initialized in MetaTT-4D. We choose the sequence ze-id-id-id (blue line) since it generally performs well on average across multiple datasets.