Table of Contents
Fetching ...

RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models

Mehdi Moshtaghi, Siavash H. Khajavi, Joni Pajarinen

TL;DR

RGB-Th-Bench addresses a critical gap in vision-language evaluation by introducing a dense, expert-annotated benchmark for RGB-Thermal understanding. The framework uses 14 skill dimensions and 1,624 Yes/No questions across two prompt groups (RGB-Txt and RGB-Th-Txt), paired with two accuracy metrics (QAcc and SAcc) to rigorously assess robustness and adversarial resilience. Evaluations across 19 state-of-the-art VLMs reveal substantial gaps, with performance heavily constrained by RGB-based pretraining and a lack of large-scale thermal-caption data, underscoring the need for thermal-focused data and learning. The work provides a foundation for advancing multimodal research in infrared image analysis and offers a publicly available benchmark and evaluation toolkit to accelerate progress in RGB-thermal understanding and deployment in real-world scenarios.

Abstract

We introduce RGB-Th-Bench, the first benchmark designed to evaluate the ability of Vision-Language Models (VLMs) to comprehend RGB-Thermal image pairs. While VLMs have demonstrated remarkable progress in visual reasoning and multimodal understanding, their evaluation has been predominantly limited to RGB-based benchmarks, leaving a critical gap in assessing their capabilities in infrared vision tasks. Existing visible-infrared datasets are either task-specific or lack high-quality annotations necessary for rigorous model evaluation. To address these limitations, RGB-Th-Bench provides a comprehensive evaluation framework covering 14 distinct skill dimensions, with a total of 1,600+ expert-annotated Yes/No questions. The benchmark employs two accuracy metrics: a standard question-level accuracy and a stricter skill-level accuracy, which evaluates model robustness across multiple questions within each skill dimension. This design ensures a thorough assessment of model performance, including resilience to adversarial and hallucinated responses. We conduct extensive evaluations on 19 state-of-the-art VLMs, revealing significant performance gaps in RGB-Thermal understanding. Our results show that even the strongest models struggle with thermal image comprehension, with performance heavily constrained by their RGB-based capabilities. Additionally, the lack of large-scale application-specific and expert-annotated thermal-caption-pair datasets in pre-training is an important reason of the observed performance gap. RGB-Th-Bench highlights the urgent need for further advancements in multimodal learning to bridge the gap between visible and thermal image understanding. The dataset is available through this link, and the evaluation code will also be made publicly available.

RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models

TL;DR

RGB-Th-Bench addresses a critical gap in vision-language evaluation by introducing a dense, expert-annotated benchmark for RGB-Thermal understanding. The framework uses 14 skill dimensions and 1,624 Yes/No questions across two prompt groups (RGB-Txt and RGB-Th-Txt), paired with two accuracy metrics (QAcc and SAcc) to rigorously assess robustness and adversarial resilience. Evaluations across 19 state-of-the-art VLMs reveal substantial gaps, with performance heavily constrained by RGB-based pretraining and a lack of large-scale thermal-caption data, underscoring the need for thermal-focused data and learning. The work provides a foundation for advancing multimodal research in infrared image analysis and offers a publicly available benchmark and evaluation toolkit to accelerate progress in RGB-thermal understanding and deployment in real-world scenarios.

Abstract

We introduce RGB-Th-Bench, the first benchmark designed to evaluate the ability of Vision-Language Models (VLMs) to comprehend RGB-Thermal image pairs. While VLMs have demonstrated remarkable progress in visual reasoning and multimodal understanding, their evaluation has been predominantly limited to RGB-based benchmarks, leaving a critical gap in assessing their capabilities in infrared vision tasks. Existing visible-infrared datasets are either task-specific or lack high-quality annotations necessary for rigorous model evaluation. To address these limitations, RGB-Th-Bench provides a comprehensive evaluation framework covering 14 distinct skill dimensions, with a total of 1,600+ expert-annotated Yes/No questions. The benchmark employs two accuracy metrics: a standard question-level accuracy and a stricter skill-level accuracy, which evaluates model robustness across multiple questions within each skill dimension. This design ensures a thorough assessment of model performance, including resilience to adversarial and hallucinated responses. We conduct extensive evaluations on 19 state-of-the-art VLMs, revealing significant performance gaps in RGB-Thermal understanding. Our results show that even the strongest models struggle with thermal image comprehension, with performance heavily constrained by their RGB-based capabilities. Additionally, the lack of large-scale application-specific and expert-annotated thermal-caption-pair datasets in pre-training is an important reason of the observed performance gap. RGB-Th-Bench highlights the urgent need for further advancements in multimodal learning to bridge the gap between visible and thermal image understanding. The dataset is available through this link, and the evaluation code will also be made publicly available.

Paper Structure

This paper contains 16 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Two data samples from RGB-Th-Bench. Due to space limit, here we show only a subset of 16 Q/As across 4 skill dimensions for each sample, and the response from 3 selected VLMs.
  • Figure 2: Some RGB-thermal pair samples in residential settings
  • Figure 3: Some RGB-thermal pair samples in industrial settings
  • Figure 4: Illustration of models' QAcc scores across all skill dimensions, with 50% being the random baseline performance, and where green represent higher values.
  • Figure 5: Illustration of models' SAcc scores across all skill dimensions, with 6.25% being the random baseline performance, and where green represent higher values.