Table of Contents
Fetching ...

UNICBench: UNIfied Counting Benchmark for MLLM

Chenggang Rong, Tao Han, Zhiyuan Zhao, Yaowu Fan, Jia Wan, Song Guo, Yuan Yuan, Junyu Gao

TL;DR

UNICBench is presented, a unified multimodal, multi level counting benchmark and evaluation toolkit with accurate ground truth, deterministic numeric parsing, and stratified reporting with strong performance on some basic counting tasks but significant gaps on reasoning and the hardest partitions.

Abstract

Counting is a core capability for multimodal large language models (MLLMs), yet there is no unified counting dataset to rigorously evaluate this ability across image, text, and audio. We present UNICBench, a unified multimodal, multi level counting benchmark and evaluation toolkit with accurate ground truth, deterministic numeric parsing, and stratified reporting. The corpus comprises 5,300 images (5,508 QA), 872 documents (5,888 QA), and 2,069 audio clips (2,905 QA), annotated with a three level capability taxonomy and difficulty tags. Under a standardized protocol with fixed splits/prompts/seeds and modality specific matching rules, we evaluate 45 state-of-the-art MLLMs across modalities. Results show strong performance on some basic counting tasks but significant gaps on reasoning and the hardest partitions, highlighting long-tail errors and substantial headroom for improving general counting. UNICBench offers a rigorous and comparable basis for measurement and a public toolkit to accelerate progress.

UNICBench: UNIfied Counting Benchmark for MLLM

TL;DR

UNICBench is presented, a unified multimodal, multi level counting benchmark and evaluation toolkit with accurate ground truth, deterministic numeric parsing, and stratified reporting with strong performance on some basic counting tasks but significant gaps on reasoning and the hardest partitions.

Abstract

Counting is a core capability for multimodal large language models (MLLMs), yet there is no unified counting dataset to rigorously evaluate this ability across image, text, and audio. We present UNICBench, a unified multimodal, multi level counting benchmark and evaluation toolkit with accurate ground truth, deterministic numeric parsing, and stratified reporting. The corpus comprises 5,300 images (5,508 QA), 872 documents (5,888 QA), and 2,069 audio clips (2,905 QA), annotated with a three level capability taxonomy and difficulty tags. Under a standardized protocol with fixed splits/prompts/seeds and modality specific matching rules, we evaluate 45 state-of-the-art MLLMs across modalities. Results show strong performance on some basic counting tasks but significant gaps on reasoning and the hardest partitions, highlighting long-tail errors and substantial headroom for improving general counting. UNICBench offers a rigorous and comparable basis for measurement and a public toolkit to accelerate progress.
Paper Structure (33 sections, 4 equations, 32 figures, 7 tables)

This paper contains 33 sections, 4 equations, 32 figures, 7 tables.

Figures (32)

  • Figure 1: Illustration of the benchmark’s task taxonomy and dataset coverage. The left pyramid groups counting problems by hierachical task level with representative Q/A examples. The right donut chart shows modality coverage and the diverse label categories in each modality, highlighting the benchmark’s broad cross‑modal and semantic coverage for unified counting evaluation.
  • Figure 2: a) Sample visualization in three modalities. b) Overview of dataset composition. Sample and question distributions across modalities and the capability/difficulty breakdown. c) Smoothed ground‑truth count distributions for three modalities, which are skewed and long‑tailed and thus motivate our stratified difficulty thresholds and evaluation protocol. d) Category word cloud based on question counts. e) f) g) Distribution of resolution/text length/ audio duration in three modalities, respectively.
  • Figure 3: Overview of the UNICBench pipeline. We standardize multi-modal datasets and define a unified QA-evidence schema with difficulty and capability labels. The end-to-end framework enables assessment of MLLM counting across image, text, and audio.
  • Figure 4: Distribution of prediction error on image modality. Whiskers and outliers indicate extreme failures—long whiskers or many outliers show a model makes severe errors on some samples (e.g., rare classes, label noise, or collapse cases). Such frequent extreme errors can substantially increase MAE/MSE even when the median error looks small.The model ordering is consistent with that in Table \ref{['tab:image_final_metrics_sci_overall']}.
  • Figure 5: Prediction vs. ground truth scatter. Each point is one sample. The upper‑right green box highlights a small set of extreme high‑count samples where the thinking mode defeated non‑thinking mode, which drives most of the overall MAE gap.
  • ...and 27 more figures