Table of Contents
Fetching ...

DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection

Yishuo Chen, Boran Wang, Xinyu Guo, Wenbin Zhu, Jiasheng He, Xiaobin Liu, Jing Yuan

TL;DR

DEYOLO addresses cross-modality object detection in low-illumination scenes by fusing RGB and infrared features in the representation space rather than at the image level. It introduces two novel modules, DECA and DEPA, to dual-enhancethe semantic and spatial information from both modalities, coupled with a Bi-directional Decoupled Focus backbone to enlarge receptive fields in multiple directions. Empirical results on M$^3$FD and LLVIP show that DEYOLO variants outperform state-of-the-art single-modality detectors and fusion-based detectors, with notable gains in mAP$_{50}$ and mAP$_{50-95}$, while KAIST offers cross-modality generalization evidence. The work provides a practical, plug-and-play framework for improving RGB-IR detection, illustrating the benefits of modality-aware feature-space fusion optimized for detection tasks.

Abstract

Object detection in poor-illumination environments is a challenging task as objects are usually not clearly visible in RGB images. As infrared images provide additional clear edge information that complements RGB images, fusing RGB and infrared images has potential to enhance the detection ability in poor-illumination environments. However, existing works involving both visible and infrared images only focus on image fusion, instead of object detection. Moreover, they directly fuse the two kinds of image modalities, which ignores the mutual interference between them. To fuse the two modalities to maximize the advantages of cross-modality, we design a dual-enhancement-based cross-modality object detection network DEYOLO, in which semantic-spatial cross modality and novel bi-directional decoupled focus modules are designed to achieve the detection-centered mutual enhancement of RGB-infrared (RGB-IR). Specifically, a dual semantic enhancing channel weight assignment module (DECA) and a dual spatial enhancing pixel weight assignment module (DEPA) are firstly proposed to aggregate cross-modality information in the feature space to improve the feature representation ability, such that feature fusion can aim at the object detection task. Meanwhile, a dual-enhancement mechanism, including enhancements for two-modality fusion and single modality, is designed in both DECAand DEPAto reduce interference between the two kinds of image modalities. Then, a novel bi-directional decoupled focus is developed to enlarge the receptive field of the backbone network in different directions, which improves the representation quality of DEYOLO. Extensive experiments on M3FD and LLVIP show that our approach outperforms SOTA object detection algorithms by a clear margin. Our code is available at https://github.com/chips96/DEYOLO.

DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection

TL;DR

DEYOLO addresses cross-modality object detection in low-illumination scenes by fusing RGB and infrared features in the representation space rather than at the image level. It introduces two novel modules, DECA and DEPA, to dual-enhancethe semantic and spatial information from both modalities, coupled with a Bi-directional Decoupled Focus backbone to enlarge receptive fields in multiple directions. Empirical results on MFD and LLVIP show that DEYOLO variants outperform state-of-the-art single-modality detectors and fusion-based detectors, with notable gains in mAP and mAP, while KAIST offers cross-modality generalization evidence. The work provides a practical, plug-and-play framework for improving RGB-IR detection, illustrating the benefits of modality-aware feature-space fusion optimized for detection tasks.

Abstract

Object detection in poor-illumination environments is a challenging task as objects are usually not clearly visible in RGB images. As infrared images provide additional clear edge information that complements RGB images, fusing RGB and infrared images has potential to enhance the detection ability in poor-illumination environments. However, existing works involving both visible and infrared images only focus on image fusion, instead of object detection. Moreover, they directly fuse the two kinds of image modalities, which ignores the mutual interference between them. To fuse the two modalities to maximize the advantages of cross-modality, we design a dual-enhancement-based cross-modality object detection network DEYOLO, in which semantic-spatial cross modality and novel bi-directional decoupled focus modules are designed to achieve the detection-centered mutual enhancement of RGB-infrared (RGB-IR). Specifically, a dual semantic enhancing channel weight assignment module (DECA) and a dual spatial enhancing pixel weight assignment module (DEPA) are firstly proposed to aggregate cross-modality information in the feature space to improve the feature representation ability, such that feature fusion can aim at the object detection task. Meanwhile, a dual-enhancement mechanism, including enhancements for two-modality fusion and single modality, is designed in both DECAand DEPAto reduce interference between the two kinds of image modalities. Then, a novel bi-directional decoupled focus is developed to enlarge the receptive field of the backbone network in different directions, which improves the representation quality of DEYOLO. Extensive experiments on M3FD and LLVIP show that our approach outperforms SOTA object detection algorithms by a clear margin. Our code is available at https://github.com/chips96/DEYOLO.

Paper Structure

This paper contains 15 sections, 9 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Detection results of different methods.
  • Figure 2: The framework of the proposed DEYOLO. We incorporate dual-context collaborative enhancement modules (DECA and DEPA) within the feature extraction streams dedicated to each detection head in order to refine the single-modality features and fuse multi-modality representations. Concurrently, the Bi-direction Decoupled Focus is inserted in the early layers of the YOLOv8 backbone to expand the network's receptive fields.
  • Figure 3: The concrete structure of DECA and DEPA. These modules utilize both single-modality and cross-modality information through a dual enhancement mechanism. DECA enhances the cross-modality fusion results by leveraging dependencies between channels within each modality and outcomes are then used to reinforce the original single-modal features, highlighting more discriminative channels. Similarly, DEPA is able to learn dependency structures within and across modalities to produce enhanced multi-modality representations with stronger positional awareness.
  • Figure 4: Bi-direction decoupled focus.
  • Figure 5: mAP$_{50}$ in specific categories
  • ...and 1 more figures