Vertical LoRA: Dense Expectation-Maximization Interpretation of Transformers
Zhuolin Fu
TL;DR
This paper addresses the challenge of deploying large Transformer models by reframing Transformers as dense Expectation-Maximization algorithms that maximize the posterior $P(y|x;\theta)$. It introduces Vertical LoRA (VLoRA), a base-plus-increments architecture where each layer learns a low-rank increment based on the previous layer, enabling substantial parameter reductions while preserving performance. The approach is theoretically grounded in an EM interpretation and operationalized by partitioning layers into chunks (VLoRA Compounds) with hierarchical increments. Empirically, VLoRA on Vision Transformer setups, exemplified by CIFAR-10 experiments, achieves major parameter savings with comparable accuracy to fully trained baselines, indicating strong practical impact for efficient model design.
Abstract
In this paper, we show how Transformers can be interpreted as dense Expectation-Maximization algorithms performed on Bayesian Nets. Based on the above interpretation, we propose a new model design paradigm, namely Vertical LoRA (VLoRA), which reduces the parameter count dramatically while preserving performance. In VLoRA, a model consists of layers, each of which recursively learns an increment based on the previous layer. We then apply LoRA decomposition to the increments. VLoRA works on the base model, which is orthogonal to LoRA, meaning they can be used together. We do experiments on various tasks and models. The results show that 1) with VLoRA, the Transformer model parameter count can be reduced dramatically and 2) the performance of the original model is preserved. The source code is available at \url{https://github.com/neverUseThisName/vlora}
