Table of Contents
Fetching ...

Vision as LoRA

Han Wang, Yongjie Ye, Bingru Li, Yuxiang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, Can Huang

TL;DR

VoRA introduces a novel encoder-free MLLM paradigm by embedding vision into an LLM via mergeable LoRA layers, avoiding external vision modules during inference. It strengthens vision understanding with block-wise distillation from a pre-trained ViT and uses bi-directional attention masks for vision tokens, enabling native-resolution vision processing. With a mixed data strategy of image-caption and text-instruction data, VoRA achieves competitive performance on multiple benchmarks relative to encoder-based baselines, while reducing inference overhead. Limitations include dependency on additional pretraining data and weaker world-knowledge performance, suggesting directions for data augmentation and token-compression improvements to broaden practical impact.

Abstract

We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.

Vision as LoRA

TL;DR

VoRA introduces a novel encoder-free MLLM paradigm by embedding vision into an LLM via mergeable LoRA layers, avoiding external vision modules during inference. It strengthens vision understanding with block-wise distillation from a pre-trained ViT and uses bi-directional attention masks for vision tokens, enabling native-resolution vision processing. With a mixed data strategy of image-caption and text-instruction data, VoRA achieves competitive performance on multiple benchmarks relative to encoder-based baselines, while reducing inference overhead. Limitations include dependency on additional pretraining data and weaker world-knowledge performance, suggesting directions for data augmentation and token-compression improvements to broaden practical impact.

Abstract

We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.

Paper Structure

This paper contains 17 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: A high-level overview of VoRA. Visual parameters are indicated with an eye icon. Mainstream MLLMs adopt a modular, sequential architecture: raw pixels are first processed by a pre-trained vision encoder to extract high-level visual features, which are then aligned with the LLM through a modality connector for vision-language tasks. In contrast, VoRA consists solely of an LLM and a lightweight embedding layer. The LoRA layers serve as visual parameters that can be integrated into the LLM without incurring additional computational costs or memory burdens.
  • Figure 2: The architecture of VoRA. Figure (a) shows the architecture of VoRA in pre-training: in this stage, VoRA only unfreezes the LoRA layers for vision and the visual embedding layer, i.e., a shallow MLP layer with a positional embedding. Figure (b) shows VoRA in inference: the LoRA layers are merged into the LLM, and thus the only added parameters are a shallow embedding layer (about 6M parameters).
  • Figure 3: Attention masks for vision: (a) causal attention inherits the autoregressive mask from language modeling, enforcing sequential dependency between image patches; (b) bidirectional attention offers full visibility between all image patches within the same input, enabling global contextual awareness.
  • Figure 4: Language modeling losses in different settings. Training the full LLM with a new modality of data can lead to unrecoverable spike in loss curve, i.e., loss collapse.
  • Figure 5: Pre-training loss curves under different configurations. Loss values are smoothed (window=100) for visual clarity. The data sampling order was fixed to ensure fair comparison, as evidenced by the similar trajectories of the loss curves in various settings. LoRA-r1024| Bidirectional| Block-wise refers to the setting: LoRA with rank 1024, bi-directional attention masks for vision, and block-wise distillation. The configuration with the lowest loss was adopted as the default setting in our experiments.
  • ...and 2 more figures