Vertical LoRA: Dense Expectation-Maximization Interpretation of Transformers

Zhuolin Fu

Vertical LoRA: Dense Expectation-Maximization Interpretation of Transformers

Zhuolin Fu

TL;DR

This paper addresses the challenge of deploying large Transformer models by reframing Transformers as dense Expectation-Maximization algorithms that maximize the posterior $P(y|x;\theta)$. It introduces Vertical LoRA (VLoRA), a base-plus-increments architecture where each layer learns a low-rank increment based on the previous layer, enabling substantial parameter reductions while preserving performance. The approach is theoretically grounded in an EM interpretation and operationalized by partitioning layers into chunks (VLoRA Compounds) with hierarchical increments. Empirically, VLoRA on Vision Transformer setups, exemplified by CIFAR-10 experiments, achieves major parameter savings with comparable accuracy to fully trained baselines, indicating strong practical impact for efficient model design.

Abstract

In this paper, we show how Transformers can be interpreted as dense Expectation-Maximization algorithms performed on Bayesian Nets. Based on the above interpretation, we propose a new model design paradigm, namely Vertical LoRA (VLoRA), which reduces the parameter count dramatically while preserving performance. In VLoRA, a model consists of layers, each of which recursively learns an increment based on the previous layer. We then apply LoRA decomposition to the increments. VLoRA works on the base model, which is orthogonal to LoRA, meaning they can be used together. We do experiments on various tasks and models. The results show that 1) with VLoRA, the Transformer model parameter count can be reduced dramatically and 2) the performance of the original model is preserved. The source code is available at \url{https://github.com/neverUseThisName/vlora}

Vertical LoRA: Dense Expectation-Maximization Interpretation of Transformers

TL;DR

This paper addresses the challenge of deploying large Transformer models by reframing Transformers as dense Expectation-Maximization algorithms that maximize the posterior

. It introduces Vertical LoRA (VLoRA), a base-plus-increments architecture where each layer learns a low-rank increment based on the previous layer, enabling substantial parameter reductions while preserving performance. The approach is theoretically grounded in an EM interpretation and operationalized by partitioning layers into chunks (VLoRA Compounds) with hierarchical increments. Empirically, VLoRA on Vision Transformer setups, exemplified by CIFAR-10 experiments, achieves major parameter savings with comparable accuracy to fully trained baselines, indicating strong practical impact for efficient model design.

Abstract

Paper Structure (11 sections, 11 equations, 2 figures, 1 table)

This paper contains 11 sections, 11 equations, 2 figures, 1 table.

Introduction
Related Work
Low-rank Decomposition
LoRA
LORS
Method
Transformers as EM Algorithms
VLoRA
Experiments
Image Classification
Conclusion

Figures (2)

Figure 1: VLoRA model architecture. We partition a $L$-layer Transformer into $k$ chunks. Each chunk contains $L/k$ layers; the first layer in a chunk is a base layer, and the remaining are VLoRA layers.
Figure 2: Training and evaluation loss and accuracy curves of ViT and its VLoRA versions on CIFAR-10

Vertical LoRA: Dense Expectation-Maximization Interpretation of Transformers

TL;DR

Abstract

Vertical LoRA: Dense Expectation-Maximization Interpretation of Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (2)