Transformer-Squared: Self-adaptive LLMs

Qi Sun; Edoardo Cetin; Yujin Tang

Transformer-Squared: Self-adaptive LLMs

Qi Sun, Edoardo Cetin, Yujin Tang

TL;DR

The paper tackles the static nature and high cost of traditional fine-tuning by proposing Transformer^2, a self-adaptive LLM framework that builds a bank of domain-specific expert vectors through Singular Value Fine-tuning (SVF). SVF learns vector $z$ to modulate weight matrices via $W' = U \Sigma' V^\top$ with $\Sigma' = \Sigma \otimes \text{diag}(z)$, enabling compact, composable adaptations trained with RL and regularized by KL penalties. In inference, Transformer^2 employs a two-pass process and three adaptation strategies to compose experts for unseen prompts, achieving superior performance with far fewer parameters than LoRA and demonstrating cross-model transfer and vision-language versatility. The work demonstrates strong empirical results across diverse LLMs and tasks, proposing a scalable pathway for truly dynamic, self-organizing AI systems with practical implications for deployment efficiency and continual learning.

Abstract

Self-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse tasks. We introduce Transformer-Squared, a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting only the singular components of their weight matrices. During inference, Transformer-Squared employs a two-pass mechanism: first, a dispatch system identifies the task properties, and then task-specific 'expert' vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt. Our method consistently outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency. Furthermore, Transformer-Squared demonstrates versatility across different LLM architectures and modalities, including vision-language tasks. Transformer-Squared represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems.

Transformer-Squared: Self-adaptive LLMs

TL;DR

to modulate weight matrices via

with

, enabling compact, composable adaptations trained with RL and regularized by KL penalties. In inference, Transformer^2 employs a two-pass process and three adaptation strategies to compose experts for unseen prompts, achieving superior performance with far fewer parameters than LoRA and demonstrating cross-model transfer and vision-language versatility. The work demonstrates strong empirical results across diverse LLMs and tasks, proposing a scalable pathway for truly dynamic, self-organizing AI systems with practical implications for deployment efficiency and continual learning.

Abstract

Paper Structure (22 sections, 2 equations, 13 figures, 10 tables)

This paper contains 22 sections, 2 equations, 13 figures, 10 tables.

Introduction
Related works
Methods
Preliminaries
$\text{Transformer}^2$
Experiments
Experimental setups
Experimental results
Analysis
Conclusion
Implementation details and hyper-parameters
SVF training
LoRA training
Hyper parameters
Few-shot adaptation
...and 7 more sections

Figures (13)

Figure 1: Overview of $\text{Transformer}^2$. In the training phase, we tune the scales of the singular values of the weight matrices to generate a set of "expert" vectors, each of which specializes in one type of tasks. In the inference phase, a two-pass process is adopted where the first applies the task-specific expert and the second generates the answer.
Figure 1: Fine-tuning results. LLM performance on the test splits of math, coding and reasoning. Normalized scores are in the parentheses.
Figure 2: Method overview. Left) At training time, we employ SVF and RL to learn the "expert" vectors $z$'s that scale the singular values of the weight matrices. Right) At inference time, we propose three distinct methods to adaptively select/combine the learned expert vectors.
Figure 3: Prompt based adaptation. Self-adaptation prompt used by $\text{Transformer}^2$ to classify the task prompt into pre-defined categories.
Figure 4: SVF learning curves. The dashed lines indicate the performance of Llama3-8B-Instruct on the test split of each task. SVF effectively fine-tunes to surpass the base performance. While we use the best validation score to select our checkpoint for evaluation (marked by red dots), we present the entire training curve without early stopping to demonstrate SVF's learning capabilities. Tasks with only hundreds of training samples like Coding and Reasoning were stopped early. In our experiments, we update the parameters at the end of each epoch.
...and 8 more figures

Transformer-Squared: Self-adaptive LLMs

TL;DR

Abstract

Transformer-Squared: Self-adaptive LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (13)