Table of Contents
Fetching ...

LGI-DETR: Local-Global Interaction for UAV Object Detection

Zifa Chen

TL;DR

LGI-DETR tackles UAV object detection by introducing cross-layer local-global feature interaction to address small-object and scale-variation challenges. Built on RT-DETR, it adds Local Spatial Enhancement (LSE) and Global Information Injection (GII) to fuse low-level spatial detail with high-level semantic context. On VisDrone2019-DET and UAVDT, it achieves higher AP and AP50 with only modest increases in parameters and FLOPs compared to baselines. The results demonstrate the effectiveness of bidirectional feature fusion for robust UAV imagery detection and improved small-object localization.

Abstract

UAV has been widely used in various fields. However, most of the existing object detectors used in drones are not end-to-end and require the design of various complex components and careful fine-tuning. Most of the existing end-to-end object detectors are designed for natural scenes. It is not ideal to apply them directly to UAV images. In order to solve the above challenges, we design an local-global information interaction DETR for UAVs, namely LGI-DETR. Cross-layer bidirectional low-level and high-level feature information enhancement, this fusion method is effective especially in the field of small objection detection. At the initial stage of encoder, we propose a local spatial enhancement module (LSE), which enhances the low-level rich local spatial information into the high-level feature, and reduces the loss of local information in the transmission process of high-level information. At the final stage of the encoder, we propose a novel global information injection module (GII) designed to integrate rich high-level global semantic representations with low-level feature maps. This hierarchical fusion mechanism effectively addresses the inherent limitations of local receptive fields by propagating contextual information across the feature hierarchy. Experimental results on two challenging UAV image object detection benchmarks, VisDrone2019 and UAVDT, show that our proposed model outperforms the SOTA model. Compared to the baseline model, AP and AP50 improved by 1.9% and 2.4%, respectively.

LGI-DETR: Local-Global Interaction for UAV Object Detection

TL;DR

LGI-DETR tackles UAV object detection by introducing cross-layer local-global feature interaction to address small-object and scale-variation challenges. Built on RT-DETR, it adds Local Spatial Enhancement (LSE) and Global Information Injection (GII) to fuse low-level spatial detail with high-level semantic context. On VisDrone2019-DET and UAVDT, it achieves higher AP and AP50 with only modest increases in parameters and FLOPs compared to baselines. The results demonstrate the effectiveness of bidirectional feature fusion for robust UAV imagery detection and improved small-object localization.

Abstract

UAV has been widely used in various fields. However, most of the existing object detectors used in drones are not end-to-end and require the design of various complex components and careful fine-tuning. Most of the existing end-to-end object detectors are designed for natural scenes. It is not ideal to apply them directly to UAV images. In order to solve the above challenges, we design an local-global information interaction DETR for UAVs, namely LGI-DETR. Cross-layer bidirectional low-level and high-level feature information enhancement, this fusion method is effective especially in the field of small objection detection. At the initial stage of encoder, we propose a local spatial enhancement module (LSE), which enhances the low-level rich local spatial information into the high-level feature, and reduces the loss of local information in the transmission process of high-level information. At the final stage of the encoder, we propose a novel global information injection module (GII) designed to integrate rich high-level global semantic representations with low-level feature maps. This hierarchical fusion mechanism effectively addresses the inherent limitations of local receptive fields by propagating contextual information across the feature hierarchy. Experimental results on two challenging UAV image object detection benchmarks, VisDrone2019 and UAVDT, show that our proposed model outperforms the SOTA model. Compared to the baseline model, AP and AP50 improved by 1.9% and 2.4%, respectively.

Paper Structure

This paper contains 18 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a) The heatmap of baseline. (b) The heatmap of LGI-DETR. The brighter areas in the heatmap indicate stronger attention by the model. Our model shows more attention on objects than baseline model.
  • Figure 2: Overview of the LGI-DETR. Firstly, the multi-scale feature of the image is extracted by backbone. Then the multi-scale feature is input into the encoder for feature representation. Finally, the detection head outputs the object detection results for the generated object query. LSE represent the Local Spatial Enhancement module; GII represent the Global Information Injection module; AIFI represent the Attentionbased Intra-scale Feature Interaction.
  • Figure 3: Overview of LSE, which comprises two branches. The first branch is the low-level feature weight extraction operation, and the second branch is to fuse the extracted low-level feature weight into high-level features. The low-level feature weights derived through the patch merge operation are subsequently integrated with selective high-level channel features.
  • Figure 4: Overview of GII, which comprises two branches. The first branch is used to obtain information about local spatial weights and global semantics, and the second branch is to fuse the extracted information into low-level features.
  • Figure 5: Visualization. In different scenarios, the detection performance of LGI-DETR and baseline is compared. First row is ground truth. Second row is baseline model's detection result. Third row is LGI-DETR's detection result.