Table of Contents
Fetching ...

Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling

Georgios Pantazopoulos, Malvina Nikandrou, Alessandro Suglia, Oliver Lemon, Arash Eshghi

TL;DR

This study explores replacing Transformers in Visual Language Models with Mamba, a recent structured state space model (SSM) that demonstrates promising performance in sequence modeling, and indicates that a task-aware encoding yields minimal performance gains on grounding, however, Transformers significantly outperform Mamba at in-context multimodal retrieval.

Abstract

This study explores replacing Transformers in Visual Language Models (VLMs) with Mamba, a recent structured state space model (SSM) that demonstrates promising performance in sequence modeling. We test models up to 3B parameters under controlled conditions, showing that Mamba-based VLMs outperforms Transformers-based VLMs in captioning, question answering, and reading comprehension. However, we find that Transformers achieve greater performance in visual grounding and the performance gap widens with scale. We explore two hypotheses to explain this phenomenon: 1) the effect of task-agnostic visual encoding on the updates of the hidden states, and 2) the difficulty in performing visual grounding from the perspective of in-context multimodal retrieval. Our results indicate that a task-aware encoding yields minimal performance gains on grounding, however, Transformers significantly outperform Mamba at in-context multimodal retrieval. Overall, Mamba shows promising performance on tasks where the correct output relies on a summary of the image but struggles when retrieval of explicit information from the context is required.

Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling

TL;DR

This study explores replacing Transformers in Visual Language Models with Mamba, a recent structured state space model (SSM) that demonstrates promising performance in sequence modeling, and indicates that a task-aware encoding yields minimal performance gains on grounding, however, Transformers significantly outperform Mamba at in-context multimodal retrieval.

Abstract

This study explores replacing Transformers in Visual Language Models (VLMs) with Mamba, a recent structured state space model (SSM) that demonstrates promising performance in sequence modeling. We test models up to 3B parameters under controlled conditions, showing that Mamba-based VLMs outperforms Transformers-based VLMs in captioning, question answering, and reading comprehension. However, we find that Transformers achieve greater performance in visual grounding and the performance gap widens with scale. We explore two hypotheses to explain this phenomenon: 1) the effect of task-agnostic visual encoding on the updates of the hidden states, and 2) the difficulty in performing visual grounding from the perspective of in-context multimodal retrieval. Our results indicate that a task-aware encoding yields minimal performance gains on grounding, however, Transformers significantly outperform Mamba at in-context multimodal retrieval. Overall, Mamba shows promising performance on tasks where the correct output relies on a summary of the image but struggles when retrieval of explicit information from the context is required.
Paper Structure (49 sections, 3 equations, 18 figures, 11 tables)

This paper contains 49 sections, 3 equations, 18 figures, 11 tables.

Figures (18)

  • Figure 1: Overview of Mamba-VL. We embed images using EVA-02 and use an MLP as V&L connector to align the image with text embeddings before the Mamba backbone. Because Mamba does not encode positional information, we introduce custom tokens that delineate the beginning and the end position of the image in the sequence. We also use custom tokens that act as row separators within the image. The vision encoder is kept frozen during training.
  • Figure 2: Overview of task categorization and format. We leverage a collection of datasets for coarse-grained (e.g., image captioning, visual question answering) and fine-grained (e.g., visual grounding, reading comprehension) multimodal tasks. Text in purple indicates the outputs of a model for each task. $^*$ denotes held-out datasets.
  • Figure 3: Results of finetuned 1.4B models with increased resolution on VQAv2 (top), RefCOCOg (middle), and TextVQA (bottom). Increasing the resolution to 480$\times$480 pixels results better performance for both models, however, Pythia benefits significantly more than Mamba in the grounding task.
  • Figure 4: Relative performance difference on visual grounding benchmarks between task-aware and task-agnostic visual encoding. On average, task-aware encoding yields a marginal performance boost on Mamba-VL while it has almost no effect on Pythia-VL.
  • Figure 5: Overview of the synthetic visual grounding task. The model accepts as input a sequence of unique special tokens, followed by an output token and a special token id that appears in the context as a query. The model needs to predict the token id corresponding to the position of the queried token.
  • ...and 13 more figures