Table of Contents
Fetching ...

NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities

Changyu Zeng, Yifan Wang, Zimu Wang, Wei Wang, Zhengni Yang, Muyi Bao, Jiming Xiao, Anh Nguyen, Yutao Yue

TL;DR

NUMINA introduces a comprehensive 3D multimodal benchmark for multi-dimensional intelligence and numerical reasoning in indoor scenes, filling a gap left by existing 3D vision-language datasets. It combines FV, PM, and NI tasks across 74,526 QA pairs generated via the NUMINA-Flow pipeline, which uses LLM rewriting and rule-based verification to ensure grounding and diversity. Experimental results under the Chat-Scene framework show that current open-source LLMs struggle with precise 3D numerical reasoning, particularly for distance and volume estimation, highlighting fundamental architectural gaps in geometric understanding. The work demonstrates the need for new geometric reasoning modules and grounded 3D supervision to enable robust multimodal reasoning, with NUMINA providing a resource and framework for future advancements in indoor spatial AI.

Abstract

Recent advancements in 2D multimodal large language models (MLLMs) have significantly improved performance in vision-language tasks. However, extending these capabilities to 3D environments remains a distinct challenge due to the complexity of spatial reasoning. Nevertheless, existing 3D benchmarks often lack fine-grained numerical reasoning task annotations, limiting MLLMs' ability to perform precise spatial measurements and complex numerical reasoning. To address this gap, we introduce NUMINA, the first Natural Understanding benchmark for Multi-dimensional Intelligence and Numerical reasoning Abilities to enhance multimodal indoor perceptual understanding. NUMINA features multi-scale annotations and various question-answer pairs, generated using NUMINA-Flow, an automated annotation pipeline that integrates LLM rewriting and rule-based self-verification. We evaluate the performance of various state-of-the-art LLMs on NUMINA following the Chat-Scene framework, demonstrating that current LLMs struggle with multimodal numerical reasoning, particularly in performing precise computations such as distance and volume estimation, highlighting the need for further advancements in 3D models. The dataset and source codes can be obtained from https://github.com/fengshun124/NUMINA.

NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities

TL;DR

NUMINA introduces a comprehensive 3D multimodal benchmark for multi-dimensional intelligence and numerical reasoning in indoor scenes, filling a gap left by existing 3D vision-language datasets. It combines FV, PM, and NI tasks across 74,526 QA pairs generated via the NUMINA-Flow pipeline, which uses LLM rewriting and rule-based verification to ensure grounding and diversity. Experimental results under the Chat-Scene framework show that current open-source LLMs struggle with precise 3D numerical reasoning, particularly for distance and volume estimation, highlighting fundamental architectural gaps in geometric understanding. The work demonstrates the need for new geometric reasoning modules and grounded 3D supervision to enable robust multimodal reasoning, with NUMINA providing a resource and framework for future advancements in indoor spatial AI.

Abstract

Recent advancements in 2D multimodal large language models (MLLMs) have significantly improved performance in vision-language tasks. However, extending these capabilities to 3D environments remains a distinct challenge due to the complexity of spatial reasoning. Nevertheless, existing 3D benchmarks often lack fine-grained numerical reasoning task annotations, limiting MLLMs' ability to perform precise spatial measurements and complex numerical reasoning. To address this gap, we introduce NUMINA, the first Natural Understanding benchmark for Multi-dimensional Intelligence and Numerical reasoning Abilities to enhance multimodal indoor perceptual understanding. NUMINA features multi-scale annotations and various question-answer pairs, generated using NUMINA-Flow, an automated annotation pipeline that integrates LLM rewriting and rule-based self-verification. We evaluate the performance of various state-of-the-art LLMs on NUMINA following the Chat-Scene framework, demonstrating that current LLMs struggle with multimodal numerical reasoning, particularly in performing precise computations such as distance and volume estimation, highlighting the need for further advancements in 3D models. The dataset and source codes can be obtained from https://github.com/fengshun124/NUMINA.

Paper Structure

This paper contains 29 sections, 1 equation, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Statistics of the NUMINA benchmark. NUMINA is composed of non-numerical and numerical questions, where the latter are further divided into three categories with increasing difficulty: Fact Validation (FV), Prompt Matching (PM), and Numerical Inference (NI).
  • Figure 2: Overview of the NUMINA-Flow pipeline. Numerical Ground Truth (NGT) is extracted from ScanNet, including instance details and pairwise distances. GPT-4o generates diverse question templates filled with NGT, followed by rule-based and manual validation. Non-numerical questions are rewritten using Qwen2.5-72B with the ScanQA dataset for added diversity.
  • Figure 3: Example of various scene understanding and numerical reasoning tasks in NUMINA dataset. All tasks are formulated as single-turn question-answering pairs without the use of additional task-specific heads, ensuring a unified and consistent evaluation framework.
  • Figure 4: Overall Chat-Scene framework. The framework processes 3D scenes through a multi-stage pipeline: (1) scene decomposition into object segments, (2) mapping of segments to multi-view images via corresponding masks, (3) extraction of object-centric representations using dedicated 3D and 2D encoders, and (4) combination of these representations with object identifiers to generate scene embeddings as sequences of object-level embeddings. These embeddings serve as input to the language model component. For evaluation on the NUMINA benchmark, we substitute the original large language model with open-source alternatives including Vicuna, Mistral, Qwen, and Phi to assess their numerical reasoning capabilities.