Table of Contents
Fetching ...

SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

Youngjoon Yu, Sangyun Chung, Byung-Kwan Lee, Yong Man Ro

TL;DR

SPARK addresses a gap in LVLM evaluation for multi-vision sensors by introducing a large-scale, sensor-grounded vision-language benchmark. It systematically tests perception and sensory reasoning across RGB, thermal, depth, and X-ray inputs, using yes/no and multi-choice formats to enforce grounded understanding. The study collects 6,248 samples from 5 data sources and evaluates 10 LVLMs, revealing notable weaknesses in sensory reasoning, especially when sensor physics must be invoked. An ablation shows prompting with sensor type improves reasoning, underscoring the need for sensor-aware architectures and instruction design for real-world sensor data.

Abstract

Large-scale Vision-Language Models (LVLMs) have significantly advanced with text-aligned vision inputs. They have made remarkable progress in computer vision tasks by aligning text modality with vision inputs. There are also endeavors to incorporate multi-vision sensors beyond RGB, including thermal, depth, and medical X-ray images. However, we observe that current LVLMs view images taken from multi-vision sensors as if they were in the same RGB domain without considering the physical characteristics of multi-vision sensors. They fail to convey the fundamental multi-vision sensor information from the dataset and the corresponding contextual knowledge properly. Consequently, alignment between the information from the actual physical environment and the text is not achieved correctly, making it difficult to answer complex sensor-related questions that consider the physical environment. In this paper, we aim to establish a multi-vision Sensor Perception And Reasoning benchmarK called SPARK that can reduce the fundamental multi-vision sensor information gap between images and multi-vision sensors. We generated 6,248 vision-language test samples to investigate multi-vision sensory perception and multi-vision sensory reasoning on physical sensor knowledge proficiency across different formats, covering different types of sensor-related questions. We utilized these samples to assess ten leading LVLMs. The results showed that most models displayed deficiencies in multi-vision sensory reasoning to varying extents. Codes and data are available at https://github.com/top-yun/SPARK

SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

TL;DR

SPARK addresses a gap in LVLM evaluation for multi-vision sensors by introducing a large-scale, sensor-grounded vision-language benchmark. It systematically tests perception and sensory reasoning across RGB, thermal, depth, and X-ray inputs, using yes/no and multi-choice formats to enforce grounded understanding. The study collects 6,248 samples from 5 data sources and evaluates 10 LVLMs, revealing notable weaknesses in sensory reasoning, especially when sensor physics must be invoked. An ablation shows prompting with sensor type improves reasoning, underscoring the need for sensor-aware architectures and instruction design for real-world sensor data.

Abstract

Large-scale Vision-Language Models (LVLMs) have significantly advanced with text-aligned vision inputs. They have made remarkable progress in computer vision tasks by aligning text modality with vision inputs. There are also endeavors to incorporate multi-vision sensors beyond RGB, including thermal, depth, and medical X-ray images. However, we observe that current LVLMs view images taken from multi-vision sensors as if they were in the same RGB domain without considering the physical characteristics of multi-vision sensors. They fail to convey the fundamental multi-vision sensor information from the dataset and the corresponding contextual knowledge properly. Consequently, alignment between the information from the actual physical environment and the text is not achieved correctly, making it difficult to answer complex sensor-related questions that consider the physical environment. In this paper, we aim to establish a multi-vision Sensor Perception And Reasoning benchmarK called SPARK that can reduce the fundamental multi-vision sensor information gap between images and multi-vision sensors. We generated 6,248 vision-language test samples to investigate multi-vision sensory perception and multi-vision sensory reasoning on physical sensor knowledge proficiency across different formats, covering different types of sensor-related questions. We utilized these samples to assess ten leading LVLMs. The results showed that most models displayed deficiencies in multi-vision sensory reasoning to varying extents. Codes and data are available at https://github.com/top-yun/SPARK
Paper Structure (13 sections, 3 figures, 3 tables)

This paper contains 13 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The comparison of sensory reasoning performance across different multi-vision sensors with respect to the recent LVLMs. Note that, sensory reasoning performance significantly drops across different multi-vision sensors.
  • Figure 2: In the proposed SPARK, we build the first benchmark for evaluating the abilities of LVLMs in multi-vision sensor understanding, which covers four types of multi-vision perception tasks (Existence, Counting, Position, and General Description) and two types of multi-vision reasoning tasks (Contextual Reasoning and Sensory Reasoning).
  • Figure 3: Distribution of data sources of the SPARK benchmark. In SPARK, we demonstrate six core multi-vision sensory tasks in the inner ring, and the outer ring displays the number of samples for each specific task.