Table of Contents
Fetching ...

SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency

Qianhao Yuan, Yanjiang Liu, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun

TL;DR

The paper addresses the efficiency gap in multimodal large language models by analyzing embedding-space and cross-attention architectures, identifying that attention among visual tokens imposes substantial inference cost and that introducing many parameters hampers training efficiency. It introduces NAAViT (No Attention Among Visual Tokens) and SAISA (Self-Attention Input Space Alignment), where visual features are aligned directly to the self-attention input spaces via a per-layer projector, removing costly visual-token attention and FFN computations on visual tokens. Empirical results show SAISA achieves the best balance of performance and efficiency, delivering up to 66% fewer inference FLOPs and 26% lower training budget while outperforming several baselines on a broad set of benchmarks; ablations confirm robustness across LLMs and visual encoders. The work provides a practical, modular path to scalable, efficient MLLMs, with public code and models to facilitate adoption and further development.

Abstract

Multimodal Large Language Models (MLLMs) mainly fall into two architectures, each involving a trade-off between training and inference efficiency: embedding space alignment (e.g., LLaVA-1.5) is inefficient during inference, while cross-attention space alignment (e.g., Flamingo) is inefficient in training. In this paper, we compare these two architectures and identify the key factors for building efficient MLLMs. A primary difference between them lies in how attention is applied to visual tokens, particularly in their interactions with each other. To investigate whether attention among visual tokens is necessary, we propose a new self-attention mechanism, NAAViT (\textbf{N}o \textbf{A}ttention \textbf{A}mong \textbf{Vi}sual \textbf{T}okens), which eliminates this type of attention. Our pilot experiment on LLaVA-1.5 shows that attention among visual tokens is highly redundant. Based on these insights, we introduce SAISA (\textbf{S}elf-\textbf{A}ttention \textbf{I}nput \textbf{S}pace \textbf{A}lignment), a novel architecture that enhance both training and inference efficiency. SAISA directly aligns visual features with the input spaces of NAAViT self-attention blocks, reducing computational overhead in both self-attention blocks and feed-forward networks (FFNs). Using the same configuration as LLaVA-1.5, SAISA reduces inference FLOPs by 66\% and training budget by 26\%, while achieving superior performance in terms of accuracy. Comprehensive ablation studies further validate the effectiveness of SAISA across various LLMs and visual encoders. The code and model will be publicly available at https://github.com/icip-cas/SAISA.

SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency

TL;DR

The paper addresses the efficiency gap in multimodal large language models by analyzing embedding-space and cross-attention architectures, identifying that attention among visual tokens imposes substantial inference cost and that introducing many parameters hampers training efficiency. It introduces NAAViT (No Attention Among Visual Tokens) and SAISA (Self-Attention Input Space Alignment), where visual features are aligned directly to the self-attention input spaces via a per-layer projector, removing costly visual-token attention and FFN computations on visual tokens. Empirical results show SAISA achieves the best balance of performance and efficiency, delivering up to 66% fewer inference FLOPs and 26% lower training budget while outperforming several baselines on a broad set of benchmarks; ablations confirm robustness across LLMs and visual encoders. The work provides a practical, modular path to scalable, efficient MLLMs, with public code and models to facilitate adoption and further development.

Abstract

Multimodal Large Language Models (MLLMs) mainly fall into two architectures, each involving a trade-off between training and inference efficiency: embedding space alignment (e.g., LLaVA-1.5) is inefficient during inference, while cross-attention space alignment (e.g., Flamingo) is inefficient in training. In this paper, we compare these two architectures and identify the key factors for building efficient MLLMs. A primary difference between them lies in how attention is applied to visual tokens, particularly in their interactions with each other. To investigate whether attention among visual tokens is necessary, we propose a new self-attention mechanism, NAAViT (\textbf{N}o \textbf{A}ttention \textbf{A}mong \textbf{Vi}sual \textbf{T}okens), which eliminates this type of attention. Our pilot experiment on LLaVA-1.5 shows that attention among visual tokens is highly redundant. Based on these insights, we introduce SAISA (\textbf{S}elf-\textbf{A}ttention \textbf{I}nput \textbf{S}pace \textbf{A}lignment), a novel architecture that enhance both training and inference efficiency. SAISA directly aligns visual features with the input spaces of NAAViT self-attention blocks, reducing computational overhead in both self-attention blocks and feed-forward networks (FFNs). Using the same configuration as LLaVA-1.5, SAISA reduces inference FLOPs by 66\% and training budget by 26\%, while achieving superior performance in terms of accuracy. Comprehensive ablation studies further validate the effectiveness of SAISA across various LLMs and visual encoders. The code and model will be publicly available at https://github.com/icip-cas/SAISA.

Paper Structure

This paper contains 30 sections, 9 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Top: Performance vs. inference efficiency based on various LLMs and visual encoders where Average Performance means an average of benchmark scores (MMMU, MMBench, MMBench-CN, POPE, GQA, SchienceQA IMG and OK-VQA) and inference efficiency is the inverse of inference TFLOPs. When trained on the same data and using the same number of visual tokens, SAISA (orange) offers a more favorable balance between inference efficiency and performance than LLaVA-1.5 (gray). Bottom: Training budget comparison between SAISA and LLaVA-1.5 where we report the training GPU hours, using Vicuna-7B as LLM and CLIP-ViT-L/14-336 as visual encoder. SAISA achieves higher training efficiency.
  • Figure 2: Overview of SAISA and the mainstream architectures to align visual features with language model. (a) Aligning visual features with the embedding space of the language model is inefficient during inference, e.g. LLaVA series. (b) Aligning visual features with the attention spaces of new cross-attention blocks is inefficient during training, e.g. Flamingo and OpenFlamingo. (c) SAISA aligns visual features with the self-attention input spaces of the language models, achieving efficiency during both training and inference.
  • Figure 3: NAAViT self-attention block. NAAViT uses only text tokens as queries. The queries can attend to visual tokens and text tokens preceding them. Visual tokens are not updated in this block.
  • Figure 4: Inference computational costs comparison between SAISA and LLaVA-1.5 with different numbers of visual and text tokens, where t denotes the number of text tokens. SAISA achieves higher computational efficiency than LLAVA-1.5.