Table of Contents
Fetching ...

Mitigating the Impact of Prominent Position Shift in Drone-based RGBT Object Detection

Yan Zhang, Wen Yang, Chang Xu, Qian Hu, Fang Xu, Gui-Song Xia

TL;DR

This work tackles the prominent position shift problem in drone-based RGB-T object detection by treating shifted GTs in the sensed modality as label noise and correcting them on the fly using a Mean Teacher-based Cross-modality Bbox Correction (CBC) framework. It couples CBC with a Shifted Window-based Cascaded Alignment (SWCA) module to address both bounding-box confusion and cross-modal feature map mismatch, enabling more informative multi-modal representations. The approach yields substantial improvements in mAP$_{50}$ on RGBTDronePerson and a shift subset of DroneVehicle, with aSim reflecting improved cross-modal alignment after correction. Overall, the method reduces supervision needs for the sensed modality and provides robust, real-world performance for tiny objects in drone imagery, with public code and data to follow.

Abstract

Drone-based RGBT object detection plays a crucial role in many around-the-clock applications. However, real-world drone-viewed RGBT data suffers from the prominent position shift problem, i.e., the position of a tiny object differs greatly in different modalities. For instance, a slight deviation of a tiny object in the thermal modality will induce it to drift from the main body of itself in the RGB modality. Considering RGBT data are usually labeled on one modality (reference), this will cause the unlabeled modality (sensed) to lack accurate supervision signals and prevent the detector from learning a good representation. Moreover, the mismatch of the corresponding feature point between the modalities will make the fused features confusing for the detection head. In this paper, we propose to cast the cross-modality box shift issue as the label noise problem and address it on the fly via a novel Mean Teacher-based Cross-modality Box Correction head ensemble (CBC). In this way, the network can learn more informative representations for both modalities. Furthermore, to alleviate the feature map mismatch problem in RGBT fusion, we devise a Shifted Window-Based Cascaded Alignment (SWCA) module. SWCA mines long-range dependencies between the spatially unaligned features inside shifted windows and cascaded aligns the sensed features with the reference ones. Extensive experiments on two drone-based RGBT object detection datasets demonstrate that the correction results are both visually and quantitatively favorable, thereby improving the detection performance. In particular, our CBC module boosts the precision of the sensed modality ground truth by 25.52 aSim points. Overall, the proposed detector achieves an mAP_50 of 43.55 points on RGBTDronePerson and surpasses a state-of-the-art method by 8.6 mAP50 on a shift subset of DroneVehicle dataset. The code and data will be made publicly available.

Mitigating the Impact of Prominent Position Shift in Drone-based RGBT Object Detection

TL;DR

This work tackles the prominent position shift problem in drone-based RGB-T object detection by treating shifted GTs in the sensed modality as label noise and correcting them on the fly using a Mean Teacher-based Cross-modality Bbox Correction (CBC) framework. It couples CBC with a Shifted Window-based Cascaded Alignment (SWCA) module to address both bounding-box confusion and cross-modal feature map mismatch, enabling more informative multi-modal representations. The approach yields substantial improvements in mAP on RGBTDronePerson and a shift subset of DroneVehicle, with aSim reflecting improved cross-modal alignment after correction. Overall, the method reduces supervision needs for the sensed modality and provides robust, real-world performance for tiny objects in drone imagery, with public code and data to follow.

Abstract

Drone-based RGBT object detection plays a crucial role in many around-the-clock applications. However, real-world drone-viewed RGBT data suffers from the prominent position shift problem, i.e., the position of a tiny object differs greatly in different modalities. For instance, a slight deviation of a tiny object in the thermal modality will induce it to drift from the main body of itself in the RGB modality. Considering RGBT data are usually labeled on one modality (reference), this will cause the unlabeled modality (sensed) to lack accurate supervision signals and prevent the detector from learning a good representation. Moreover, the mismatch of the corresponding feature point between the modalities will make the fused features confusing for the detection head. In this paper, we propose to cast the cross-modality box shift issue as the label noise problem and address it on the fly via a novel Mean Teacher-based Cross-modality Box Correction head ensemble (CBC). In this way, the network can learn more informative representations for both modalities. Furthermore, to alleviate the feature map mismatch problem in RGBT fusion, we devise a Shifted Window-Based Cascaded Alignment (SWCA) module. SWCA mines long-range dependencies between the spatially unaligned features inside shifted windows and cascaded aligns the sensed features with the reference ones. Extensive experiments on two drone-based RGBT object detection datasets demonstrate that the correction results are both visually and quantitatively favorable, thereby improving the detection performance. In particular, our CBC module boosts the precision of the sensed modality ground truth by 25.52 aSim points. Overall, the proposed detector achieves an mAP_50 of 43.55 points on RGBTDronePerson and surpasses a state-of-the-art method by 8.6 mAP50 on a shift subset of DroneVehicle dataset. The code and data will be made publicly available.

Paper Structure

This paper contains 17 sections, 11 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: (a) Demonstration on prominent position shifts. "Reference" denotes the modality with aligned GTs; "Sensed" denotes the one without GTs. Yellow and green masks indicate the main body of the person and rider objects. (b) The detection results on the visible modality of a thermal-trained detector. This well represents the modality transferability between the RGB and thermal modalities.
  • Figure 2: The overall structure of the proposed method in the training stage. The mainstream of the scheme is an RGBT fusion-and-detection scheme (down). Considering the prominent position shift in the sensed modality, we design a Cross-modality Bbox Correction (CBC) module under a Mean Teacher framework (up). The teacher CBC head takes the sensed feature $F^s$ as input and yield sensed GTs $\{(b^{s*}_{k+1},c^*_{k+1})\}$ via a cross-modality bbox correction strategy. The strategy consists of three steps, namely bag construction, sample selection, and bbox correction. Finally, the student CBC head is updated by the supervised reference loss $\mathcal{L}^r_{cbc}$ and the "unsupervised" sensed loss $\mathcal{L}^s_{cbc}$. The teacher weight is updated by the exponential moving average (EMA) of student weight.
  • Figure 3: Two successive SWCA Transformer blocks. OP denotes the offset predictor. The cross-attention mechanism in WCA and SWCA is the same except for the window split.
  • Figure 4: Detection results on RGBTDronePerson. Green boxes denote true positives; red boxes denote false negatives; blue boxes denote false positives. (a) Visible images. (b) Thermal images and detection results of our method. (c) Detection results of the QFDet. (d) Detection results of C$^2$Former.
  • Figure 5: Correction results on RGBTDronePerson. Every sub-figure is a zoomed-in area of one image to show the corrected tiny objects clearly. Red boxes denote shifted GTs and green boxes denote corrected GTs. (a)-(h) are eight examples from different images.
  • ...and 3 more figures