Table of Contents
Fetching ...

AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection

Zizhao Chen, Yeqiang Qian, Xiaoxiao Yang, Chunxiang Wang, Ming Yang

TL;DR

This work addresses the high inference cost of multispectral pedestrian detection by introducing Adaptive Modal Fusion Distillation (AMFD), which distills original RGB and TIR features into a lightweight single-stream student via a fusion distillation architecture. A pair of Modal Extraction Alignment (MEA) modules, incorporating global and focal attention, guides the student to learn adaptive fusion strategies without relying on the teacher's fusion module. The authors also release the SJTU Multispectral Object Detection (SMOD) dataset and demonstrate across KAIST, LLVIP, and SMOD that AMFD improves detection metrics (MR^{-2} and mAP) while significantly reducing inference time, enabling practical deployment on embedded devices. The approach offers a flexible, hardware-friendly path to efficient multispectral perception in autonomous systems and related applications.

Abstract

Multispectral pedestrian detection has been shown to be effective in improving performance within complex illumination scenarios. However, prevalent double-stream networks in multispectral detection employ two separate feature extraction branches for multi-modal data, leading to nearly double the inference time compared to single-stream networks utilizing only one feature extraction branch. This increased inference time has hindered the widespread employment of multispectral pedestrian detection in embedded devices for autonomous systems. To address this limitation, various knowledge distillation methods have been proposed. However, traditional distillation methods focus only on the fusion features and ignore the large amount of information in the original multi-modal features, thereby restricting the student network's performance. To tackle the challenge, we introduce the Adaptive Modal Fusion Distillation (AMFD) framework, which can fully utilize the original modal features of the teacher network. Specifically, a Modal Extraction Alignment (MEA) module is utilized to derive learning weights for student networks, integrating focal and global attention mechanisms. This methodology enables the student network to acquire optimal fusion strategies independent from that of teacher network without necessitating an additional feature fusion module. Furthermore, we present the SMOD dataset, a well-aligned challenging multispectral dataset for detection. Extensive experiments on the challenging KAIST, LLVIP and SMOD datasets are conducted to validate the effectiveness of AMFD. The results demonstrate that our method outperforms existing state-of-the-art methods in both reducing log-average Miss Rate and improving mean Average Precision. The code is available at https://github.com/bigD233/AMFD.git.

AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection

TL;DR

This work addresses the high inference cost of multispectral pedestrian detection by introducing Adaptive Modal Fusion Distillation (AMFD), which distills original RGB and TIR features into a lightweight single-stream student via a fusion distillation architecture. A pair of Modal Extraction Alignment (MEA) modules, incorporating global and focal attention, guides the student to learn adaptive fusion strategies without relying on the teacher's fusion module. The authors also release the SJTU Multispectral Object Detection (SMOD) dataset and demonstrate across KAIST, LLVIP, and SMOD that AMFD improves detection metrics (MR^{-2} and mAP) while significantly reducing inference time, enabling practical deployment on embedded devices. The approach offers a flexible, hardware-friendly path to efficient multispectral perception in autonomous systems and related applications.

Abstract

Multispectral pedestrian detection has been shown to be effective in improving performance within complex illumination scenarios. However, prevalent double-stream networks in multispectral detection employ two separate feature extraction branches for multi-modal data, leading to nearly double the inference time compared to single-stream networks utilizing only one feature extraction branch. This increased inference time has hindered the widespread employment of multispectral pedestrian detection in embedded devices for autonomous systems. To address this limitation, various knowledge distillation methods have been proposed. However, traditional distillation methods focus only on the fusion features and ignore the large amount of information in the original multi-modal features, thereby restricting the student network's performance. To tackle the challenge, we introduce the Adaptive Modal Fusion Distillation (AMFD) framework, which can fully utilize the original modal features of the teacher network. Specifically, a Modal Extraction Alignment (MEA) module is utilized to derive learning weights for student networks, integrating focal and global attention mechanisms. This methodology enables the student network to acquire optimal fusion strategies independent from that of teacher network without necessitating an additional feature fusion module. Furthermore, we present the SMOD dataset, a well-aligned challenging multispectral dataset for detection. Extensive experiments on the challenging KAIST, LLVIP and SMOD datasets are conducted to validate the effectiveness of AMFD. The results demonstrate that our method outperforms existing state-of-the-art methods in both reducing log-average Miss Rate and improving mean Average Precision. The code is available at https://github.com/bigD233/AMFD.git.
Paper Structure (23 sections, 12 equations, 12 figures, 10 tables)

This paper contains 23 sections, 12 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: The experimental results of pedestrian detection on KAISThwang2015multispectral dataset. The teacher is a two-stream network with a complex fusion module and ResNet50 backbone2016Deep. The students are single-stream networks(Faster-RCNN2017Faster and RetinaNetlin2017focal) with simple image-level fusion and ResNet18 backbone2016Deep.
  • Figure 2: Based on the fusion feature maps of different networks, we obtain the spatial attention of the feature map. We can see that (c) is similar to (b), but the noise in the black box is not suppressed. Noise makes pedestrians in the two red boxes in figure (a) not easily recognizable in figure (c). Our method shown in (d) no longer follows the fusion strategy of teacher can well represent the pedestrian features.
  • Figure 3: The overall architecture of the proposed distillation framework. Firstly, we use a frozen two-stream network with complex feature extractor as the teacher network. The student network is a single-stream network with a simple image-level fusion module. Then during training, the framework distills the knowledge of RGB and TIR features of the teacher network into the fusion features of the student network through the fusion distillation architecture. The fusion distillation architecture contains two modal extraction alignment modules (MEA) to adaptively extract the difference between the fusion feature of the student network and the RGB, TIR feature of the teacher network. At last, the student network is optimized by two MEA losses produced by MEA and the original detection loss.
  • Figure 4: The comparison between fusion distillation architectures and traditional architecture. The position of the feature for distillation is advanced from the fusion feature to the original modal feature.
  • Figure 5: The structure of our modal extraction alignment (MEA) module. (a) shows the overall structure of the MEA module. Both input features go through two modules, Global Feature Extraction (GE) and Focal Feature Extraction (FE), which ultimately form a MEA loss consisting of the global loss and focal loss. (b) and (c) show the details of the GE and FE. GE generates a $C\times1\times 1$ weight and this weight is added to the original feature map by broadcast channel-wise addition, aiming to obtain a feature map with better global relation.
  • ...and 7 more figures