Table of Contents
Fetching ...

Visual Perception by Large Language Model's Weights

Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun

TL;DR

This paper addresses the inefficiency of multimodal large language models caused by token-heavy visual inputs in input-space alignment. It proposes parameter-space alignment via VLoRA, where visual features are transformed into low-rank perceptual weights that are merged into the LLM’s weight matrices, eliminating the need for visual tokens during inference. The perceptual weights generator employs a LoRA-like decomposition to produce $\\Delta W$ and inject it into $k$ decoder blocks across all five weight types, achieving substantial computational savings while maintaining competitive accuracy on six benchmarks. The approach demonstrates promising efficiency gains and broad applicability, with plans to open-source code and models. The work also provides a detailed cost analysis and thorough ablations to justify design choices.

Abstract

Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM's weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. The code and models will be made open-source.

Visual Perception by Large Language Model's Weights

TL;DR

This paper addresses the inefficiency of multimodal large language models caused by token-heavy visual inputs in input-space alignment. It proposes parameter-space alignment via VLoRA, where visual features are transformed into low-rank perceptual weights that are merged into the LLM’s weight matrices, eliminating the need for visual tokens during inference. The perceptual weights generator employs a LoRA-like decomposition to produce and inject it into decoder blocks across all five weight types, achieving substantial computational savings while maintaining competitive accuracy on six benchmarks. The approach demonstrates promising efficiency gains and broad applicability, with plans to open-source code and models. The work also provides a detailed cost analysis and thorough ablations to justify design choices.

Abstract

Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM's weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. The code and models will be made open-source.
Paper Structure (19 sections, 6 equations, 5 figures, 5 tables)

This paper contains 19 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of the input space alignment and the parameter space alignment paradigms. The input space alignment paradigm is aligning visual features with the input space of LLM and concatenating visual tokens with text tokens as input for LLM. Our proposed VLoRA follows the parameter space alignment paradigm that aligns visual features with the parameters of LLM and merges perceptual weights generated by the perceptual weights generator with LLM's weights.
  • Figure 2: Details of the LLM Decoder Block. (a) illustrates the details of the LLM decoder block, including the multi-head self-attention module and the feed-forward network. (b) provides a detailed view of the multi-head self-attention module, which incorporates four types of weights: $W_Q$, $W_K$, $W_V$, and $W_O$. (c) depicts the feed-forward network, which consists of the weights $W_1$ and $W_2$.
  • Figure 3: Perceptual Weights Generator. Figure (a) illustrates the pipeline of our perceptual weights generator. We set $k$ learnable perceptual queries, which interact with image features in $N$ decoder blocks, and obtain $k$ visual parameters. Then, a shared linear layer and $k$ independent linear layers are used to convert these visual parameters to perceptual weights $\Delta W$. Figure (b) demonstrates that our approach is formally consistent with LoRA.
  • Figure 4: Comparison of FLOPs. This figure shows the FLOPs of LLaVA and VLoRA with different numbers of input visual tokens. The left subplot illustrates the change in GFLOPs, the right subplot plots the ratio of GFLOPs for VLoRA to LLaVA, and C denotes the number of text tokens.
  • Figure 5: Visualization results of VLoRA. This figure demonstrates the capabilities of our VLoRA in real-world scenarios, including accurate counting and common sense reasoning.