Table of Contents
Fetching ...

FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem

TL;DR

The findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning, and highlight FewMMBench as a rigorous testbed for diagnosing and advancing few-shot capabilities in multimodal LLMs.

Abstract

As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting. Covering a diverse suite of multimodal understanding tasks, from attribute recognition to temporal reasoning, FewMMBench enables systematic analysis across task types, model families, and prompting strategies. We evaluate 26 open-weight MLLMs from six model families across zero-shot, few-shot, and CoT-augmented few-shot settings. Our findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning. Retrieval-based demonstrations and increased context size also yield limited gains. These results highlight FewMMBench as a rigorous testbed for diagnosing and advancing few-shot capabilities in multimodal LLMs. The data is available at: https://huggingface.co/datasets/mustafaa/FewMMBench

FewMMBench: A Benchmark for Multimodal Few-Shot Learning

TL;DR

The findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning, and highlight FewMMBench as a rigorous testbed for diagnosing and advancing few-shot capabilities in multimodal LLMs.

Abstract

As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting. Covering a diverse suite of multimodal understanding tasks, from attribute recognition to temporal reasoning, FewMMBench enables systematic analysis across task types, model families, and prompting strategies. We evaluate 26 open-weight MLLMs from six model families across zero-shot, few-shot, and CoT-augmented few-shot settings. Our findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning. Retrieval-based demonstrations and increased context size also yield limited gains. These results highlight FewMMBench as a rigorous testbed for diagnosing and advancing few-shot capabilities in multimodal LLMs. The data is available at: https://huggingface.co/datasets/mustafaa/FewMMBench
Paper Structure (59 sections, 1 equation, 38 figures, 5 tables)

This paper contains 59 sections, 1 equation, 38 figures, 5 tables.

Figures (38)

  • Figure 1: Performance of selected MLLMs on FewMMBench across different evaluation settings. We compare instruction-tuned and non-instruction-tuned models under zero-shot, few-shot (random and similarity-based), and CoT-augmented few-shot configurations. Results show that few-shot prompting does not consistently improve performance for instruction-tuned models, even when demonstrations are semantically similar to the query or when the number of examples increases. Notably, CoT prompting often leads to a performance drop, suggesting modality-specific limitations in current CoT strategies.
  • Figure 2: Dataset Curation Pipeline for FewMMBench. (a) Task instances are collected and organized based on linguistically meaningful phenomena. (b) We extract visual and textual features for each instance and construct query sets using a Graph Cut-based submodular selection strategy, ensuring both diversity and representativeness. (c) CoT rationales are generated using the Qwen2.5-VL-7B-Instruct model. If the initial prediction is incorrect, the correct answer is injected and a new rationale is generated. An automated filter retains only high-quality examples.
  • Figure 3: Task distributions and representative examples in FewMMBench. The pie chart illustrates the distribution of samples across the nine tasks in the benchmark. Each surrounding example depicts a sample question corresponding to a specific task. These examples highlight the diversity of visual-linguistic reasoning skills covered by FewMMBench, spanning low-level perception, numerical reasoning, and high-level cognitive inference.
  • Figure 4: Accuracy performance of top-performing MLLMs from each model family on FewMMBench, evaluated with 0-4-8 shots across three settings: Random, Similar, and Similar with CoT settings.
  • Figure 5: Pairwise accuracy performance of the top-performing model from three model families on FewMMBench, evaluated with 0, 4, and 8 shots across two settings: Random and Similar.
  • ...and 33 more figures