Table of Contents
Fetching ...

The Solution for the GAIIC2024 RGB-TIR object detection Challenge

Xiangyu Wu, Jinling Xu, Longfei Huang, Yang Yang

TL;DR

This work tackles RGB-TIR object detection for unmanned aerial vehicles under challenging conditions such as complex backgrounds, lighting variations, and miscalibrated sensor pairs. It introduces a lightweight YOLOv9 framework augmented with dual backbones, multi-level auxiliary supervision, and a transformer-based feature-level fusion module to fuse RGB and TIR features adaptively. Modality-specific data augmentation and diverse ensemble strategies enhance cross-domain robustness, leveraging external datasets like DroneVehicle and Visdrone. The approach achieves competitive results (mAP 0.543 on A and 0.516 on B) at 26 FPS, demonstrating practical viability for real-time, drone-based RGB-TIR detection in varied urban and rural scenes.

Abstract

This report introduces a solution to The task of RGB-TIR object detection from the perspective of unmanned aerial vehicles. Unlike traditional object detection methods, RGB-TIR object detection aims to utilize both RGB and TIR images for complementary information during detection. The challenges of RGB-TIR object detection from the perspective of unmanned aerial vehicles include highly complex image backgrounds, frequent changes in lighting, and uncalibrated RGB-TIR image pairs. To address these challenges at the model level, we utilized a lightweight YOLOv9 model with extended multi-level auxiliary branches that enhance the model's robustness, making it more suitable for practical applications in unmanned aerial vehicle scenarios. For image fusion in RGB-TIR detection, we incorporated a fusion module into the backbone network to fuse images at the feature level, implicitly addressing calibration issues. Our proposed method achieved an mAP score of 0.516 and 0.543 on A and B benchmarks respectively while maintaining the highest inference speed among all models.

The Solution for the GAIIC2024 RGB-TIR object detection Challenge

TL;DR

This work tackles RGB-TIR object detection for unmanned aerial vehicles under challenging conditions such as complex backgrounds, lighting variations, and miscalibrated sensor pairs. It introduces a lightweight YOLOv9 framework augmented with dual backbones, multi-level auxiliary supervision, and a transformer-based feature-level fusion module to fuse RGB and TIR features adaptively. Modality-specific data augmentation and diverse ensemble strategies enhance cross-domain robustness, leveraging external datasets like DroneVehicle and Visdrone. The approach achieves competitive results (mAP 0.543 on A and 0.516 on B) at 26 FPS, demonstrating practical viability for real-time, drone-based RGB-TIR detection in varied urban and rural scenes.

Abstract

This report introduces a solution to The task of RGB-TIR object detection from the perspective of unmanned aerial vehicles. Unlike traditional object detection methods, RGB-TIR object detection aims to utilize both RGB and TIR images for complementary information during detection. The challenges of RGB-TIR object detection from the perspective of unmanned aerial vehicles include highly complex image backgrounds, frequent changes in lighting, and uncalibrated RGB-TIR image pairs. To address these challenges at the model level, we utilized a lightweight YOLOv9 model with extended multi-level auxiliary branches that enhance the model's robustness, making it more suitable for practical applications in unmanned aerial vehicle scenarios. For image fusion in RGB-TIR detection, we incorporated a fusion module into the backbone network to fuse images at the feature level, implicitly addressing calibration issues. Our proposed method achieved an mAP score of 0.516 and 0.543 on A and B benchmarks respectively while maintaining the highest inference speed among all models.
Paper Structure (12 sections, 4 figures, 1 table)

This paper contains 12 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: The data comprises various scenes and environments. There are significant differences in the quality and clarity between RGB and TIR images, with notable noise issues. Variations in imaging time and the operational states of different vehicles lead to slight discrepancies in the positions of corresponding vehicles in paired images.
  • Figure 2: Overall Architecture. Our method consists of three main components: extended multi-level auxiliary branches, image feature-level fusion, and data augmentation and model ensemble. We apply different data augmentation techniques to different modal images and perform image fusion at three scales.
  • Figure 3: Multi-level Auxiliary Supervision Branch strategy (taking RGB images as an example).
  • Figure 4: Multi-level Auxiliary Supervision Branch strategy (taking RGB images as an example).