Table of Contents
Fetching ...

FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

Bo Tong, Bokai Lai, Yiyi Zhou, Gen Luo, Yunhang Shen, Ke Li, Xiaoshuai Sun, Rongrong Ji

TL;DR

The paper tackles the latency bottleneck of multimodal LLMs caused by large visual token budgets. It introduces FlashSloth, a tiny MLLM that embeds two complementary visual-compression modules—Spatial Attention Pooling (SAP) for saliency-driven token reduction and Embedded Query (EmbQ) for instruction-aware visual grounding—into a unified architecture, avoiding external pretraining. Through a two-stage training regime and a high-resolution variant (FlashSloth-HD), the approach achieves substantial efficiency gains (token counts reduced by $80$-$89\%$, memory by $61$-$80\%$, FLOPs by $70$-$98\%$, and latency improved by $2$-$5\times$) while maintaining competitive performance on 14 VL benchmarks and seven multimodal tasks. The results suggest that embedding visual compression within the MLLM pipeline can yield practical, deployable improvements for real-world vision-language reasoning without the overhead of large-scale VL alignment pretraining.

Abstract

Despite a big leap forward in capability, multimodal large language models (MLLMs) tend to behave like a sloth in practical use, i.e., slow response and large latency. Recent efforts are devoted to building tiny MLLMs for better efficiency, but the plethora of visual tokens still used limit their actual speedup. In this paper, we propose a powerful and fast tiny MLLM called FlashSloth. Different from previous efforts, FlashSloth focuses on improving the descriptive power of visual tokens in the process of compressing their redundant semantics. In particular, FlashSloth introduces embedded visual compression designs to capture both visually salient and instruction-related image information, so as to achieving superior multimodal performance with fewer visual tokens. Extensive experiments are conducted to validate the proposed FlashSloth, and a bunch of tiny but strong MLLMs are also comprehensively compared, e.g., InternVL2, MiniCPM-V2 and Qwen2-VL. The experimental results show that compared with these advanced tiny MLLMs, our FlashSloth can greatly reduce the number of visual tokens, training memory and computation complexity while retaining high performance on various VL tasks.

FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

TL;DR

The paper tackles the latency bottleneck of multimodal LLMs caused by large visual token budgets. It introduces FlashSloth, a tiny MLLM that embeds two complementary visual-compression modules—Spatial Attention Pooling (SAP) for saliency-driven token reduction and Embedded Query (EmbQ) for instruction-aware visual grounding—into a unified architecture, avoiding external pretraining. Through a two-stage training regime and a high-resolution variant (FlashSloth-HD), the approach achieves substantial efficiency gains (token counts reduced by -, memory by -, FLOPs by -, and latency improved by -) while maintaining competitive performance on 14 VL benchmarks and seven multimodal tasks. The results suggest that embedding visual compression within the MLLM pipeline can yield practical, deployable improvements for real-world vision-language reasoning without the overhead of large-scale VL alignment pretraining.

Abstract

Despite a big leap forward in capability, multimodal large language models (MLLMs) tend to behave like a sloth in practical use, i.e., slow response and large latency. Recent efforts are devoted to building tiny MLLMs for better efficiency, but the plethora of visual tokens still used limit their actual speedup. In this paper, we propose a powerful and fast tiny MLLM called FlashSloth. Different from previous efforts, FlashSloth focuses on improving the descriptive power of visual tokens in the process of compressing their redundant semantics. In particular, FlashSloth introduces embedded visual compression designs to capture both visually salient and instruction-related image information, so as to achieving superior multimodal performance with fewer visual tokens. Extensive experiments are conducted to validate the proposed FlashSloth, and a bunch of tiny but strong MLLMs are also comprehensively compared, e.g., InternVL2, MiniCPM-V2 and Qwen2-VL. The experimental results show that compared with these advanced tiny MLLMs, our FlashSloth can greatly reduce the number of visual tokens, training memory and computation complexity while retaining high performance on various VL tasks.

Paper Structure

This paper contains 25 sections, 6 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Comparison between FlashSloth and recent MLLMs on MMB in terms of performance, response time (the prediction of first token) and GPU memory overhead (the circle size). Advanced tiny MLLMs minigeminichu2024mobilevlmshao2024impminicpmwang2024qwen2internvl can already exhibit strong capability against common MLLMs like LLaVA-1.5-7B llava1.5, but their actual speed up is greatly limited by the excessive use of visual tokens. Our FlashSloth is a powerful and tiny MLLM that offers a decent balance between performance and efficiency.
  • Figure 2: The overall framework of the proposed FlashSloth. The visual tokens extracted by the vision encoder are first refined and compressed by a Spatial Attention Pooling (SAP) module and then fed to FlashSloth. In addition to visual and text tokens, a set of learnable query tokens are also padded to query instruction-related image information via the Embedded Query Module (EmbQ) after some layers of FlashSloth. In particular, SAP is to capture the visually salient semantics in image regions via uni-modal visual attention, as depicted in the left. EmbQ is a lightweight and embedded module for visual enhancement in FlashSloth, which requires no additional language modeling and dedicated alignment pretraining, as shown in the right.
  • Figure 3: Comparison between FlashSloth and three MLLMs in terms of training efficiency. The results are obtained using LLaVA-665k llava1.5 for fair comparisons. FlashSloth is superior in both GPU memory overhead and training time costs.
  • Figure 4: Visualized results of FlashSloth with Qwen2-VL-2B and InternVL2-2.2B. Subfigure-(a) show the attention maps of different visual compressions for FlashSloth, which shows the abilities of SAP in visual saliency compression and EmbQ for instruction-related visual querying. Subfigure-(b) shows FlashSloth's rapid response time and its performance on common tasks, which is comparable to or better than the SOTA tiny MLLMs. The clock time represents the response of the model. Incorrect answers are in RED.
  • Figure 5: FlashSloth's Performance in Ticket OCR Recognition and Mathematical Question Answering.
  • ...and 4 more figures