Table of Contents
Fetching ...

Selective State Space Memory for Large Vision-Language Models

Chee Ng, Yuen Fung

TL;DR

This work tackles the costly fine-tuning of large vision-language models by introducing State Space Memory Integration (SSMI), which embeds lightweight Mamba-based state space modules into LVLMs to capture long-range dependencies with minimal parameter updates. The approach uses a two-stage training pipeline—pretraining Mamba modules on general vision-language tasks followed by task-specific fine-tuning—while freezing most LVLM parameters, yielding a trainable fraction of around $0.5\%$. Empirical results on COCO Captioning, VQA, and Flickr30k demonstrate state-of-the-art performance with high efficiency, robustness to noise, and strong zero-shot generalization, validated by quantitative metrics and human evaluation. These findings position SSMI as a scalable, interpretable, and practical fine-tuning framework for large multimodal models, enabling broader applicability in domain-specific settings.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across a wide range of multimodal tasks. However, fine-tuning these models for domain-specific applications remains a computationally intensive challenge. This paper introduces State Space Memory Integration (SSMI), a novel approach for efficient fine-tuning of LVLMs. By integrating lightweight Mamba-based state space modules into the LVLM architecture, SSMI captures long-range dependencies and injects task-specific visual and sequential patterns effectively. Unlike traditional fine-tuning methods, SSMI requires only a fraction of the model's parameters to be updated, making it computationally efficient and scalable. Experiments on benchmark datasets, including COCO Captioning, VQA, and Flickr30k, demonstrate that SSMI achieves state-of-the-art performance while maintaining robustness and generalization capabilities. Comprehensive analysis further validates the advantages of SSMI in terms of efficiency, adaptability, and interpretability, positioning it as a compelling solution for fine-tuning large-scale vision-language models.

Selective State Space Memory for Large Vision-Language Models

TL;DR

This work tackles the costly fine-tuning of large vision-language models by introducing State Space Memory Integration (SSMI), which embeds lightweight Mamba-based state space modules into LVLMs to capture long-range dependencies with minimal parameter updates. The approach uses a two-stage training pipeline—pretraining Mamba modules on general vision-language tasks followed by task-specific fine-tuning—while freezing most LVLM parameters, yielding a trainable fraction of around . Empirical results on COCO Captioning, VQA, and Flickr30k demonstrate state-of-the-art performance with high efficiency, robustness to noise, and strong zero-shot generalization, validated by quantitative metrics and human evaluation. These findings position SSMI as a scalable, interpretable, and practical fine-tuning framework for large multimodal models, enabling broader applicability in domain-specific settings.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across a wide range of multimodal tasks. However, fine-tuning these models for domain-specific applications remains a computationally intensive challenge. This paper introduces State Space Memory Integration (SSMI), a novel approach for efficient fine-tuning of LVLMs. By integrating lightweight Mamba-based state space modules into the LVLM architecture, SSMI captures long-range dependencies and injects task-specific visual and sequential patterns effectively. Unlike traditional fine-tuning methods, SSMI requires only a fraction of the model's parameters to be updated, making it computationally efficient and scalable. Experiments on benchmark datasets, including COCO Captioning, VQA, and Flickr30k, demonstrate that SSMI achieves state-of-the-art performance while maintaining robustness and generalization capabilities. Comprehensive analysis further validates the advantages of SSMI in terms of efficiency, adaptability, and interpretability, positioning it as a compelling solution for fine-tuning large-scale vision-language models.

Paper Structure

This paper contains 23 sections, 8 equations, 6 tables.