Table of Contents
Fetching ...

CROME: Cross-Modal Adapters for Efficient Multimodal LLM

Sayna Ebrahimi, Sercan O. Arik, Tejas Nama, Tomas Pfister

TL;DR

CROME tackles the high cost of adapting Multimodal LLMs by introducing a lightweight, gated cross-modal adapter that fuses visual and textual features before a frozen LLM. It relies on pre-LM alignment and three training stages—pretraining, instruction tuning, and optional task-specific fine-tuning—requiring only about 5M trainable parameters for adaptation. Empirically, CROME achieves state-of-the-art results on multiple open benchmarks, with strong zero-shot performance and substantial gains in task-specific fine-tuning (e.g., up to 93.2% on ScienceQA-Image after adaptation). The approach highlights the practicality and scalability of pre-LM alignment for flexible multimodal learning, reducing costs while preserving the LLM’s capabilities and enabling targeted downstream performance.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate remarkable image-language capabilities, but their widespread use faces challenges in cost-effective training and adaptation. Existing approaches often necessitate expensive language model retraining and limited adaptability. Additionally, the current focus on zero-shot performance improvements offers insufficient guidance for task-specific tuning. We propose CROME, an efficient vision-language instruction tuning framework. It features a novel gated cross-modal adapter that effectively combines visual and textual representations prior to input into a frozen LLM. This lightweight adapter, trained with minimal parameters, enables efficient cross-modal understanding. Notably, CROME demonstrates superior zero-shot performance on standard visual question answering and instruction-following benchmarks. Moreover, it yields fine-tuning with exceptional parameter efficiency, competing with task-specific specialist state-of-the-art methods. CROME demonstrates the potential of pre-LM alignment for building scalable, adaptable, and parameter-efficient multimodal models.

CROME: Cross-Modal Adapters for Efficient Multimodal LLM

TL;DR

CROME tackles the high cost of adapting Multimodal LLMs by introducing a lightweight, gated cross-modal adapter that fuses visual and textual features before a frozen LLM. It relies on pre-LM alignment and three training stages—pretraining, instruction tuning, and optional task-specific fine-tuning—requiring only about 5M trainable parameters for adaptation. Empirically, CROME achieves state-of-the-art results on multiple open benchmarks, with strong zero-shot performance and substantial gains in task-specific fine-tuning (e.g., up to 93.2% on ScienceQA-Image after adaptation). The approach highlights the practicality and scalability of pre-LM alignment for flexible multimodal learning, reducing costs while preserving the LLM’s capabilities and enabling targeted downstream performance.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate remarkable image-language capabilities, but their widespread use faces challenges in cost-effective training and adaptation. Existing approaches often necessitate expensive language model retraining and limited adaptability. Additionally, the current focus on zero-shot performance improvements offers insufficient guidance for task-specific tuning. We propose CROME, an efficient vision-language instruction tuning framework. It features a novel gated cross-modal adapter that effectively combines visual and textual representations prior to input into a frozen LLM. This lightweight adapter, trained with minimal parameters, enables efficient cross-modal understanding. Notably, CROME demonstrates superior zero-shot performance on standard visual question answering and instruction-following benchmarks. Moreover, it yields fine-tuning with exceptional parameter efficiency, competing with task-specific specialist state-of-the-art methods. CROME demonstrates the potential of pre-LM alignment for building scalable, adaptable, and parameter-efficient multimodal models.
Paper Structure (29 sections, 2 equations, 4 figures, 7 tables)

This paper contains 29 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: CROME achieves state-of-the-art results on 6 MLLM benchmarks (top right) with its unique pre-LM cross-modal adapter (middle right). Bottom Right: Training data and parameter comparisons. Left: A qualitative example on the unusual "ironing man" image.
  • Figure 2: Overview of our CROME model architecture with CROME-Adapter. CROME takes both image and text as input to generate output text autoregressively. The text input query is encoded by the Q-Former, which utilizes learnable queries to effectively represent instruction-aware visual features. These are then processed by a projection layer. The image input is encoded by a vision encoder and its patch embeddings are used in both the Q-Former's cross-attention layers and a projection layer. Within the cross-modal adapter, the projected image and text features undergo individual down-projection using a gated linear unit ([see \ref{['sec:crossadapter']}]). They are then up-projected through a weight sharing linear layer. Finally, the cross-modal adapter outputs for text and image are concatenated with the tokenized question and fed into the LLM to obtain the text output.
  • Figure 3: Overview of CROME training stages. Blue indicates frozen components, and red indicates trainable components. (a) Pretraining: CROME-Adapter and projection layers are trained on image-caption pairs. (b) Instruction-tuning: Q-Former, CROME-Adapter, and projection layers are trained on diverse image-instruction datasets. (c) Task-specific fine-tuning: CROME-Adapter facilitates efficient training on task-specific data.
  • Figure 4: Qualitative examples from various zero-shot MLLM benchmarks we have evaluated CROME on.