CaMML: Context-Aware Multimodal Learner for Large Models

Yixin Chen; Shuai Zhang; Boran Han; Tong He; Bo Li

CaMML: Context-Aware Multimodal Learner for Large Models

Yixin Chen, Shuai Zhang, Boran Han, Tong He, Bo Li

TL;DR

CaMML tackles the inflexibility of static LMMs by introducing a lightweight, context-aware module that ingests long multimodal context samples. It employs a hierarchical CaMML Perceiver pipeline—Vision Perceiver, Language Perceiver, and Context Perceiver—to fuse interleaved image-text contexts into a compact representation ($M$ tokens) that conditions an LLM, using a frozen ImageBind-Faiss retriever to fetch top-$N$ samples via a fixed datastore and $ ext{loss}\ell = -\sum_{i=1}^{|y|} \log p_{\theta}(y_i | \hat{y}_{1:i-1}, q, C_{1},...,C_{N})$ for training. CaMML-7B and CaMML-13B achieve state-of-the-art performance across more than ten multimodal benchmarks, including ScienceQA, without external data integration, and exhibit strong context-awareness, efficient long-context processing, and reduced multimodal hallucination. These results highlight CaMML's practical impact for scalable multimodal reasoning in real-world tasks where up-to-date, domain-specific context is essential.

Abstract

In this work, we introduce Context-Aware MultiModal Learner (CaMML), for tuning large multimodal models (LMMs). CaMML, a lightweight module, is crafted to seamlessly integrate multimodal contextual samples into large models, thereby empowering the model to derive knowledge from analogous, domain-specific, up-to-date information and make grounded inferences. Importantly, CaMML is highly scalable and can efficiently handle lengthy multimodal context examples owing to its hierarchical design. Based on CaMML, we have developed two multimodal models, CaMML-7B and CaMML-13B, that have shown exceptional performance across an array of benchmark datasets for multimodal tasks. Remarkably, CaMML-13B achieves the state-of-the-art performance on over ten widely recognized multimodal benchmark datasets, surpassing LLaVA-1.5 (13B) with a noticeable margin, without integration of any external resources. Moreover, we have conducted extensive ablative studies to inspect the inner workings of CaMML and performed qualitative analyses to showcase its effectiveness in handling real-world challenging cases. Code and models are available at: https://github.com/amazon-science/camml.

CaMML: Context-Aware Multimodal Learner for Large Models

TL;DR

tokens) that conditions an LLM, using a frozen ImageBind-Faiss retriever to fetch top-

samples via a fixed datastore and

for training. CaMML-7B and CaMML-13B achieve state-of-the-art performance across more than ten multimodal benchmarks, including ScienceQA, without external data integration, and exhibit strong context-awareness, efficient long-context processing, and reduced multimodal hallucination. These results highlight CaMML's practical impact for scalable multimodal reasoning in real-world tasks where up-to-date, domain-specific context is essential.

Abstract

Paper Structure (41 sections, 5 equations, 10 figures, 6 tables)

This paper contains 41 sections, 5 equations, 10 figures, 6 tables.

Introduction
Related Work
Large Multimodal Models
Multimodal Few-shot Learning
Context-Aware Multimodal Learner
Architecture of CaMML
Datastore and Context Retriever
Multimodal CaMML Perceiver
Vision Perceiver
Language Perceiver
Context Perceiver
Model Training
Experiment
Multimodal Reasoning on ScienceQA
Multimodal Instruction Tuning
...and 26 more sections

Figures (10)

Figure 1: CaMML achieves the state-of-the-art performance on a number of multimodal benchmarks, outperforming LLaVA-1.5 and many other large multimodal models.
Figure 2: CaMML framework, which consists of retriever, perceiver and generator. Once receiving user query $q$, CaMML retriever identifies relevant multimodal contexts $C$ from datastore, then CaMML Perceiver seamlessly integrates various modalities, effectively encodeing long-context information and injecting it into the CaMML generator. This allows for the prediction of responses that are conditioned on both the context and the query $q$.
Figure 3: Ablation Experiments on CaMML perceiver hyper-parameters: layers, query number $M$ and hidden sizes. CaMML-7B with different settings are evaluated on ScienceQA test.
Figure 4: Ablation Experiments on CaMML context number $N$. Left: different CaMML models trained on $N$ shots are evaluated under 1$\sim$32 shots. Right: comparison between CaMML and CaMML without perceiver in terms of inference running time and memory footprint, the statistic is averaged on 100 samples from CaMML-7B, which are tested on NVIDIA A100-80G GPU using ScienceQA dataset.
Figure 5: Visualization of context-aware CaMML vs. no-context LLaVA-1.5. Left: sketch drawing of the Great Wall. Right: depiction of metamorphosis of a butterfly.
...and 5 more figures

CaMML: Context-Aware Multimodal Learner for Large Models

TL;DR

Abstract

CaMML: Context-Aware Multimodal Learner for Large Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)