MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection
Taeheon Kim, Sangyun Chung, Damin Yeom, Youngjoon Yu, Hak Gu Kim, Yong Man Ro
TL;DR
This work addresses modality bias in multispectral pedestrian detection by introducing MSCoTDet, a framework that combines language-driven reasoning with vision-based detectors. It deploys Multispectral Chain-of-Thought (MSCoT) prompting to generate calibrated, rationale-based detection scores from text descriptions of RGB and thermal images, and fuses them with vision-based detections through a Language-driven Multi-modal Fusion (LMF) module. The approach includes a robust preprocessing step (DPair, VCM, CPDG) to align cross-modal detections and a two-stage prompting strategy that first yields uni-modal scores and then cross-modal fused scores. Empirical results on FLIR, CVC-14, and ROTX-MP show MSCoTDet achieves state-of-the-art AP and MR, with strong generalization under distribution shifts, demonstrating the practical potential of language-guided multimodal fusion for real-world all-day pedestrian detection. The method also provides comprehensive ablations confirming the benefits of language integration, fusion strategies, and LLM choice, while maintaining competitive efficiency. Overall, MSCoTDet offers a principled, scalable path to mitigating modality biases in multispectral perception tasks with direct applicability to safety-critical systems.
Abstract
Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate modality bias in multispectral pedestrian detection using Large Language Models (LLMs). Accordingly, we design a Multispectral Chain-of-Thought (MSCoT) prompting strategy, which prompts the LLM to perform multispectral pedestrian detection. Moreover, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection. To this end, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing the outputs of MSCoT prompting with the detection results of vision-based multispectral pedestrian detection models. Extensive experiments validate that MSCoTDet effectively mitigates modality biases and improves multispectral pedestrian detection.
