MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization

Yinghui Xing; Shuo Yang; Song Wang; Shizhou Zhang; Guoqiang Liang; Xiuwei Zhang; Yanning Zhang

MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization

Yinghui Xing, Shuo Yang, Song Wang, Shizhou Zhang, Guoqiang Liang, Xiuwei Zhang, Yanning Zhang

TL;DR

This paper proposes an instance-aware modality-balanced optimization strategy, which preserves visible and thermal decoder branches and aligns their predicted slots through an instance-wise dynamic loss, and shows superior performance on the challenging KAIST, CVC-14 and LLVIP benchmark datasets.

Abstract

Multispectral pedestrian detection is an important task for many around-the-clock applications, since the visible and thermal modalities can provide complementary information especially under low light conditions. Due to the presence of two modalities, misalignment and modality imbalance are the most significant issues in multispectral pedestrian detection. In this paper, we propose M ulti S pectral pedestrian DE tection TR ansformer (MS-DETR) to fix above issues. MS-DETR consists of two modality-specific backbones and Transformer encoders, followed by a multi-modal Transformer decoder, and the visible and thermal features are fused in the multi-modal Transformer decoder. To well resist the misalignment between multi-modal images, we design a loosely coupled fusion strategy by sparsely sampling some keypoints from multi-modal features independently and fusing them with adaptively learned attention weights. Moreover, based on the insight that not only different modalities, but also different pedestrian instances tend to have different confidence scores to final detection, we further propose an instance-aware modality-balanced optimization strategy, which preserves visible and thermal decoder branches and aligns their predicted slots through an instance-wise dynamic loss. Our end-to-end MS-DETR shows superior performance on the challenging KAIST, CVC-14 and LLVIP benchmark datasets. The source code is available at https://github.com/YinghuiXing/MS-DETR.

MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization

TL;DR

Abstract

Paper Structure (17 sections, 12 equations, 11 figures, 9 tables)

This paper contains 17 sections, 12 equations, 11 figures, 9 tables.

Introduction
Related Works
Multispectral Pedestrian Detection
Object Detection with Transformer
Proposed Method
Overview
Multi-Modal Transformer Decoder
Instance-aware Modality-Balanced Optimization
Inference process
Experiments
Datasets and Evaluation Metric
Implementation Details
Comparisons with State-of-the-art
Ablation Studies
Visualizations of Loosely Coupled Fusion
...and 2 more sections

Figures (11)

Figure 1: Different fusion strategy of multispectral pedestrian detection. Densely coupled fusion, e.g., concatenation and addition, tends to drift from the target on the misaligned image pairs. In contrast, our loosely coupled fusion aggregates sampled keypoints, which is robust to resist the misalignment.
Figure 2: Overall architecture of our MS-DETR, where two modality-specific CNN backbones and two modality-specific Transformer encoders are used for visible and thermal images feature extraction. The multi-modal Transformer decoder takes feature maps, positional encodings (PE) and modality-specific content embeddings (CE) as inputs to generate three sets of prediction slots. V, F, T are acronyms for visible, fusion, and thermal.
Figure 3: The multi-modal Transformer decoder of MS-DETR. It has three branches, i.e. visible (V), fusion (F), and thermal (T), where PE, CE, SA, CA and FFN are acronyms for Positional Encodings, Content Embeddings, Self-Attention, Cross-Attention and feed forward network. Dashed lines indicate shared parameters.
Figure 4: Details of multi-modal cross-attention (CA) module. The positional encodings pass through a linear layer to predict reference points, and they are then combined with content embeddings of fusion branch to predict offsets and their corresponding attention weights. Given a group of reference points and offsets, two groups of feature points are sampled from multi-modal multi-scale features. These sampled feature points are fused by weighted sum operations.
Figure 5: The visualization examples of position shift problem in CVC-14 dataset, where the pedestrians in (a) are grossly misaligned in spatial dimensions and the number of pedestrians in (b) are unpaired.
...and 6 more figures

MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization

TL;DR

Abstract

MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (11)