Table of Contents
Fetching ...

MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection

Taeheon Kim, Sangyun Chung, Damin Yeom, Youngjoon Yu, Hak Gu Kim, Yong Man Ro

TL;DR

This work addresses modality bias in multispectral pedestrian detection by introducing MSCoTDet, a framework that combines language-driven reasoning with vision-based detectors. It deploys Multispectral Chain-of-Thought (MSCoT) prompting to generate calibrated, rationale-based detection scores from text descriptions of RGB and thermal images, and fuses them with vision-based detections through a Language-driven Multi-modal Fusion (LMF) module. The approach includes a robust preprocessing step (DPair, VCM, CPDG) to align cross-modal detections and a two-stage prompting strategy that first yields uni-modal scores and then cross-modal fused scores. Empirical results on FLIR, CVC-14, and ROTX-MP show MSCoTDet achieves state-of-the-art AP and MR, with strong generalization under distribution shifts, demonstrating the practical potential of language-guided multimodal fusion for real-world all-day pedestrian detection. The method also provides comprehensive ablations confirming the benefits of language integration, fusion strategies, and LLM choice, while maintaining competitive efficiency. Overall, MSCoTDet offers a principled, scalable path to mitigating modality biases in multispectral perception tasks with direct applicability to safety-critical systems.

Abstract

Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate modality bias in multispectral pedestrian detection using Large Language Models (LLMs). Accordingly, we design a Multispectral Chain-of-Thought (MSCoT) prompting strategy, which prompts the LLM to perform multispectral pedestrian detection. Moreover, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection. To this end, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing the outputs of MSCoT prompting with the detection results of vision-based multispectral pedestrian detection models. Extensive experiments validate that MSCoTDet effectively mitigates modality biases and improves multispectral pedestrian detection.

MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection

TL;DR

This work addresses modality bias in multispectral pedestrian detection by introducing MSCoTDet, a framework that combines language-driven reasoning with vision-based detectors. It deploys Multispectral Chain-of-Thought (MSCoT) prompting to generate calibrated, rationale-based detection scores from text descriptions of RGB and thermal images, and fuses them with vision-based detections through a Language-driven Multi-modal Fusion (LMF) module. The approach includes a robust preprocessing step (DPair, VCM, CPDG) to align cross-modal detections and a two-stage prompting strategy that first yields uni-modal scores and then cross-modal fused scores. Empirical results on FLIR, CVC-14, and ROTX-MP show MSCoTDet achieves state-of-the-art AP and MR, with strong generalization under distribution shifts, demonstrating the practical potential of language-guided multimodal fusion for real-world all-day pedestrian detection. The method also provides comprehensive ablations confirming the benefits of language integration, fusion strategies, and LLM choice, while maintaining competitive efficiency. Overall, MSCoTDet offers a principled, scalable path to mitigating modality biases in multispectral perception tasks with direct applicability to safety-critical systems.

Abstract

Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate modality bias in multispectral pedestrian detection using Large Language Models (LLMs). Accordingly, we design a Multispectral Chain-of-Thought (MSCoT) prompting strategy, which prompts the LLM to perform multispectral pedestrian detection. Moreover, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection. To this end, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing the outputs of MSCoT prompting with the detection results of vision-based multispectral pedestrian detection models. Extensive experiments validate that MSCoTDet effectively mitigates modality biases and improves multispectral pedestrian detection.
Paper Structure (27 sections, 12 equations, 6 figures, 6 tables, 4 algorithms)

This paper contains 27 sections, 12 equations, 6 figures, 6 tables, 4 algorithms.

Figures (6)

  • Figure 1: Problem illustration and our motivation. (a) In multispectral pedestrian datasets, thermal signatures always appear on pedestrians, as the thermal modality can generally capture pedestrians all day/night. In these datasets, thermal-obscured data is underrepresented. Models trained on such datasets learn the statistical co-occurrences between pedestrians and their thermal signatures. (b) As a result, models fail to detect pedestrians in thermal-obscured data, even though obviously visible in RGB. (c) An example of prompting the LLM. Based on the textual descriptions of RGB and thermal images, we prompt the ChatGPT chatgpt to answer the question "Based on these descriptions, what is in these images?". The ChatGPT detects the person without suffering modality biases, realizing that a person wearing heat-insulation clothing is invisible in thermal images due to the reflective material. Our motivation is to develop MSCoT prompting based on LLMs and integrate it into vision-based multispectral pedestrian detectors.
  • Figure 2: Comparisons between previous works and our method (MSCoTDet). (a) Previous approaches centrally focused on mid-fusion methods, e.g., mid-fusing features internally in the network. (b) There are few works via late fusion that ensemble detections from independently trained single-modal detectors, i.e., RGB and thermal detectors. (c) MSCoTDet (Ours) focuses on designing a language branch that processes detection using Large Language Models (LLMs). The language branch includes the MSCoT prompting, which prompts the LLM to perform multispectral pedestrian detection. Then, our proposed Language-driven Multi-modal Fusion (LMF) enables fusing vision-driven and language-driven detections.
  • Figure 3: Overall architecture of proposed Multispectral Chain-of-Thought detection (MSCoTDet) framework including vision branch, language branch, and Language-driven Multi-modal Fusion.
  • Figure 4: Visualized details of the (a) Detection Pairing Module and the (b) Visual Crop & Mark Module. (a) The Detection Pairing Module gets single-modal detections from the vision branch and then finds the detection pairs that belong to the same pedestrians, e.g., $b_{1}^{paired}$= ($b_{1, RGB}^{paired}$,$b_{1, T}^{paired}$) and $s_{1}^{paired}$= ($s_{1, RGB}^{paired}$,$s_{1, T}^{paired}$). Through the iteration of pedestrians, the module produces the sets of paired detections $B^{paired}$ and $S^{paired}$. (b) The Visual Crop & Mark Module gets $B^{paired}$ as inputs, and output the pre-processed images $X_{ RGB}^{ VCM}$ and $X_{ T}^{ VCM}$.
  • Figure 5: Visualized process of our proposed Multispectral Chain-of-Thought (MSCoT) prompting.
  • ...and 1 more figures