ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

Ayush Shrivastava; Kirtan Gangani; Laksh Jain; Mayank Goel; Nipun Batra

ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

Ayush Shrivastava, Kirtan Gangani, Laksh Jain, Mayank Goel, Nipun Batra

TL;DR

TherEval addresses the gap in evaluating vision-language models on thermal imagery by introducing ThermEval-B, a ~55k thermal-VQA benchmark across seven tasks, and ThermEval-D, a dense per-pixel temperature dataset. It demonstrates that RGB-trained VLMs struggle with temperature-grounded reasoning, colorbar interpretation, and colormap robustness, with language priors often biasing answers. Through zero-shot prompting and a dedicated parser, the study reveals both the capabilities and limitations of current models, and shows that supervised fine-tuning can substantially improve performance yet falls short of reliable real-world thermal understanding. The work argues for domain-grounded pretraining that explicitly incorporates physical sensor modalities to enable true thermal reasoning in multimodal models.

Abstract

Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning. These results demonstrate that thermal understanding requires dedicated evaluation beyond RGB-centric assumptions, positioning ThermEval as a benchmark to drive progress in thermal vision language modeling.

ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

TL;DR

Abstract

Paper Structure (56 sections, 1 equation, 8 figures, 13 tables)

This paper contains 56 sections, 1 equation, 8 figures, 13 tables.

Introduction
Related Work
Thermal and Multi-Spectral Benchmarks
Thermal and Infrared Datasets
False-Colored Thermal Images
ThermEval
ThermEval-B: Benchmark
Benchmark Tasks
Modality Identification (T1 and T2):
Human Presence and Counting (T3) :
Inferring the Colorbar (T4) :
Thermal Reasoning (T5):
Temperature Extraction (T6 and T7) :
ThermEval-D: Dataset
Data Collection Protocol
...and 41 more sections

Figures (8)

Figure 1: Thermal imagery enables critical perception tasks in settings where RGB fails, but VLMs trained predominantly on RGB exhibit systematic errors when applied to thermal images, driven by modality mismatch and language priors.
Figure 2: ThermEval defines seven evaluation tasks covering modality identification (T1–T2), human counting (T3), colorbar interpretation (T4), thermal reasoning (T5), and temperature estimation (T6–T7), designed to probe complementary aspects of thermal vision language understanding.
Figure 3: Images from ThermEval-D dataset. The top row shows the images having a single person in the scene whereas the second row shows the images having more than one person in the scene. Colorbars were added programatically during task evaluation
Figure 4: Demonstrates some of the images from the FLIR-ADAS dataset, which is used for Tasks T-1, T-2, and T-3. Top row shows thermal images while the bottom shows RGB for different scenes. More information regarding the tasks could be obtained from section \ref{['sec:Benchmark-tasks']}.
Figure 5: Demonstrates some of the images from the LLVIP dataset, which is used for Tasks T-1, T-2, and T-3.Top row shows thermal images while the bottom shows RGB for different scenes. More information regarding the tasks could be obtained from section \ref{['sec:Benchmark-tasks']}.
...and 3 more figures

ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

TL;DR

Abstract

ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

Authors

TL;DR

Abstract

Table of Contents

Figures (8)