NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities
Changyu Zeng, Yifan Wang, Zimu Wang, Wei Wang, Zhengni Yang, Muyi Bao, Jiming Xiao, Anh Nguyen, Yutao Yue
TL;DR
NUMINA introduces a comprehensive 3D multimodal benchmark for multi-dimensional intelligence and numerical reasoning in indoor scenes, filling a gap left by existing 3D vision-language datasets. It combines FV, PM, and NI tasks across 74,526 QA pairs generated via the NUMINA-Flow pipeline, which uses LLM rewriting and rule-based verification to ensure grounding and diversity. Experimental results under the Chat-Scene framework show that current open-source LLMs struggle with precise 3D numerical reasoning, particularly for distance and volume estimation, highlighting fundamental architectural gaps in geometric understanding. The work demonstrates the need for new geometric reasoning modules and grounded 3D supervision to enable robust multimodal reasoning, with NUMINA providing a resource and framework for future advancements in indoor spatial AI.
Abstract
Recent advancements in 2D multimodal large language models (MLLMs) have significantly improved performance in vision-language tasks. However, extending these capabilities to 3D environments remains a distinct challenge due to the complexity of spatial reasoning. Nevertheless, existing 3D benchmarks often lack fine-grained numerical reasoning task annotations, limiting MLLMs' ability to perform precise spatial measurements and complex numerical reasoning. To address this gap, we introduce NUMINA, the first Natural Understanding benchmark for Multi-dimensional Intelligence and Numerical reasoning Abilities to enhance multimodal indoor perceptual understanding. NUMINA features multi-scale annotations and various question-answer pairs, generated using NUMINA-Flow, an automated annotation pipeline that integrates LLM rewriting and rule-based self-verification. We evaluate the performance of various state-of-the-art LLMs on NUMINA following the Chat-Scene framework, demonstrating that current LLMs struggle with multimodal numerical reasoning, particularly in performing precise computations such as distance and volume estimation, highlighting the need for further advancements in 3D models. The dataset and source codes can be obtained from https://github.com/fengshun124/NUMINA.
