Table of Contents
Fetching ...

Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

Quoc-Huy Trinh, Mustapha Abdullahi, Bo Zhao, Debesh Jha

Abstract

Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: https://fireboltvl.github.io

Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

Abstract

Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: https://fireboltvl.github.io

Paper Structure

This paper contains 19 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Performance and latency efficiency of Firebolt-VL on multiple tasks compared with state-of-the-art baselines. Firebolt-VL is both competitive and efficient, demonstrating strong generalization across diverse tasks.
  • Figure 2: Overview of the Firebolt-VL architecture. The Cross-Modal Modulator (CMM) fuses textual instructions with the visual representations of the query image to produce conditioned tokens, which are then processed by the Liquid Foundation Model (LFM). The model is trained in two stages: (1) CMM pre-training to initialize modulation parameters, and (2) end-to-end training of the full framework.
  • Figure 3: Qualitative comparison of responses from Firebolt-VL with recent efficient vision--language models, including MobileVLM V2 mobilevlmv2 and SmolVLM2 smolvlm, on detail-dependent question-answering tasks. Firebolt-VL demonstrates stronger fine-grained grounding and more accurate, instruction-aligned responses.
  • Figure 4: Accuracy--latency comparison of compact MLLMs. Bubble area denotes parameter count (B), and annotations indicate the exact model size. Firebolt-VL provides a favorable accuracy--latency trade-off, achieving a higher MME$^{p}$ perception score at lower inference latency.