Table of Contents
Fetching ...

IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting

Tao Zhang, Yuyang Hong, Yang Xia, Kun Ding, Zeyu Zhang, Ying Wang, Shiming Xiang, Chunhong Pan

TL;DR

IF-Bench introduces a dedicated infrared-image understanding benchmark for multimodal LLMs, revealing how model scale, architecture, and thinking paradigms influence infrared comprehension. It then presents GenViP, a training-free generative prompting approach that translates infrared images to RGB and combines dual inputs to mitigate domain shifts, achieving consistent gains across >40 MLLMs. The work demonstrates that GenViP yields substantial improvements, especially for smaller models, and shows that open-source editing models can be competitive with closed-source ones after targeted tuning. With publicly available benchmark data and code, this paper advances infrared multimodal understanding without requiring infrared–text pairs or fine-tuning for each model.

Abstract

Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce IF-Bench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (GenViP) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs. The benchmark and code are available at https://github.com/casiatao/IF-Bench.

IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting

TL;DR

IF-Bench introduces a dedicated infrared-image understanding benchmark for multimodal LLMs, revealing how model scale, architecture, and thinking paradigms influence infrared comprehension. It then presents GenViP, a training-free generative prompting approach that translates infrared images to RGB and combines dual inputs to mitigate domain shifts, achieving consistent gains across >40 MLLMs. The work demonstrates that GenViP yields substantial improvements, especially for smaller models, and shows that open-source editing models can be competitive with closed-source ones after targeted tuning. With publicly available benchmark data and code, this paper advances infrared multimodal understanding without requiring infrared–text pairs or fine-tuning for each model.

Abstract

Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce IF-Bench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (GenViP) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs. The benchmark and code are available at https://github.com/casiatao/IF-Bench.

Paper Structure

This paper contains 18 sections, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Distribution of questions across dimensions in IF-Bench.
  • Figure 2: The performance of GenViP on IF-Bench.
  • Figure 3: Construction pipeline and evaluation protocol of IF-Bench.
  • Figure 4: The illustration of GenViP.
  • Figure 5: The average performance change after using thinking.
  • ...and 10 more figures