Table of Contents
Fetching ...

CaMML: Context-Aware Multimodal Learner for Large Models

Yixin Chen, Shuai Zhang, Boran Han, Tong He, Bo Li

TL;DR

CaMML tackles the inflexibility of static LMMs by introducing a lightweight, context-aware module that ingests long multimodal context samples. It employs a hierarchical CaMML Perceiver pipeline—Vision Perceiver, Language Perceiver, and Context Perceiver—to fuse interleaved image-text contexts into a compact representation ($M$ tokens) that conditions an LLM, using a frozen ImageBind-Faiss retriever to fetch top-$N$ samples via a fixed datastore and $ ext{loss}\ell = -\sum_{i=1}^{|y|} \log p_{\theta}(y_i | \hat{y}_{1:i-1}, q, C_{1},...,C_{N})$ for training. CaMML-7B and CaMML-13B achieve state-of-the-art performance across more than ten multimodal benchmarks, including ScienceQA, without external data integration, and exhibit strong context-awareness, efficient long-context processing, and reduced multimodal hallucination. These results highlight CaMML's practical impact for scalable multimodal reasoning in real-world tasks where up-to-date, domain-specific context is essential.

Abstract

In this work, we introduce Context-Aware MultiModal Learner (CaMML), for tuning large multimodal models (LMMs). CaMML, a lightweight module, is crafted to seamlessly integrate multimodal contextual samples into large models, thereby empowering the model to derive knowledge from analogous, domain-specific, up-to-date information and make grounded inferences. Importantly, CaMML is highly scalable and can efficiently handle lengthy multimodal context examples owing to its hierarchical design. Based on CaMML, we have developed two multimodal models, CaMML-7B and CaMML-13B, that have shown exceptional performance across an array of benchmark datasets for multimodal tasks. Remarkably, CaMML-13B achieves the state-of-the-art performance on over ten widely recognized multimodal benchmark datasets, surpassing LLaVA-1.5 (13B) with a noticeable margin, without integration of any external resources. Moreover, we have conducted extensive ablative studies to inspect the inner workings of CaMML and performed qualitative analyses to showcase its effectiveness in handling real-world challenging cases. Code and models are available at: https://github.com/amazon-science/camml.

CaMML: Context-Aware Multimodal Learner for Large Models

TL;DR

CaMML tackles the inflexibility of static LMMs by introducing a lightweight, context-aware module that ingests long multimodal context samples. It employs a hierarchical CaMML Perceiver pipeline—Vision Perceiver, Language Perceiver, and Context Perceiver—to fuse interleaved image-text contexts into a compact representation ( tokens) that conditions an LLM, using a frozen ImageBind-Faiss retriever to fetch top- samples via a fixed datastore and for training. CaMML-7B and CaMML-13B achieve state-of-the-art performance across more than ten multimodal benchmarks, including ScienceQA, without external data integration, and exhibit strong context-awareness, efficient long-context processing, and reduced multimodal hallucination. These results highlight CaMML's practical impact for scalable multimodal reasoning in real-world tasks where up-to-date, domain-specific context is essential.

Abstract

In this work, we introduce Context-Aware MultiModal Learner (CaMML), for tuning large multimodal models (LMMs). CaMML, a lightweight module, is crafted to seamlessly integrate multimodal contextual samples into large models, thereby empowering the model to derive knowledge from analogous, domain-specific, up-to-date information and make grounded inferences. Importantly, CaMML is highly scalable and can efficiently handle lengthy multimodal context examples owing to its hierarchical design. Based on CaMML, we have developed two multimodal models, CaMML-7B and CaMML-13B, that have shown exceptional performance across an array of benchmark datasets for multimodal tasks. Remarkably, CaMML-13B achieves the state-of-the-art performance on over ten widely recognized multimodal benchmark datasets, surpassing LLaVA-1.5 (13B) with a noticeable margin, without integration of any external resources. Moreover, we have conducted extensive ablative studies to inspect the inner workings of CaMML and performed qualitative analyses to showcase its effectiveness in handling real-world challenging cases. Code and models are available at: https://github.com/amazon-science/camml.
Paper Structure (41 sections, 5 equations, 10 figures, 6 tables)

This paper contains 41 sections, 5 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: CaMML achieves the state-of-the-art performance on a number of multimodal benchmarks, outperforming LLaVA-1.5 and many other large multimodal models.
  • Figure 2: CaMML framework, which consists of retriever, perceiver and generator. Once receiving user query $q$, CaMML retriever identifies relevant multimodal contexts $C$ from datastore, then CaMML Perceiver seamlessly integrates various modalities, effectively encodeing long-context information and injecting it into the CaMML generator. This allows for the prediction of responses that are conditioned on both the context and the query $q$.
  • Figure 3: Ablation Experiments on CaMML perceiver hyper-parameters: layers, query number $M$ and hidden sizes. CaMML-7B with different settings are evaluated on ScienceQA test.
  • Figure 4: Ablation Experiments on CaMML context number $N$. Left: different CaMML models trained on $N$ shots are evaluated under 1$\sim$32 shots. Right: comparison between CaMML and CaMML without perceiver in terms of inference running time and memory footprint, the statistic is averaged on 100 samples from CaMML-7B, which are tested on NVIDIA A100-80G GPU using ScienceQA dataset.
  • Figure 5: Visualization of context-aware CaMML vs. no-context LLaVA-1.5. Left: sketch drawing of the Great Wall. Right: depiction of metamorphosis of a butterfly.
  • ...and 5 more figures