CaMML: Context-Aware Multimodal Learner for Large Models
Yixin Chen, Shuai Zhang, Boran Han, Tong He, Bo Li
TL;DR
CaMML tackles the inflexibility of static LMMs by introducing a lightweight, context-aware module that ingests long multimodal context samples. It employs a hierarchical CaMML Perceiver pipeline—Vision Perceiver, Language Perceiver, and Context Perceiver—to fuse interleaved image-text contexts into a compact representation ($M$ tokens) that conditions an LLM, using a frozen ImageBind-Faiss retriever to fetch top-$N$ samples via a fixed datastore and $ ext{loss}\ell = -\sum_{i=1}^{|y|} \log p_{\theta}(y_i | \hat{y}_{1:i-1}, q, C_{1},...,C_{N})$ for training. CaMML-7B and CaMML-13B achieve state-of-the-art performance across more than ten multimodal benchmarks, including ScienceQA, without external data integration, and exhibit strong context-awareness, efficient long-context processing, and reduced multimodal hallucination. These results highlight CaMML's practical impact for scalable multimodal reasoning in real-world tasks where up-to-date, domain-specific context is essential.
Abstract
In this work, we introduce Context-Aware MultiModal Learner (CaMML), for tuning large multimodal models (LMMs). CaMML, a lightweight module, is crafted to seamlessly integrate multimodal contextual samples into large models, thereby empowering the model to derive knowledge from analogous, domain-specific, up-to-date information and make grounded inferences. Importantly, CaMML is highly scalable and can efficiently handle lengthy multimodal context examples owing to its hierarchical design. Based on CaMML, we have developed two multimodal models, CaMML-7B and CaMML-13B, that have shown exceptional performance across an array of benchmark datasets for multimodal tasks. Remarkably, CaMML-13B achieves the state-of-the-art performance on over ten widely recognized multimodal benchmark datasets, surpassing LLaVA-1.5 (13B) with a noticeable margin, without integration of any external resources. Moreover, we have conducted extensive ablative studies to inspect the inner workings of CaMML and performed qualitative analyses to showcase its effectiveness in handling real-world challenging cases. Code and models are available at: https://github.com/amazon-science/camml.
