Table of Contents
Fetching ...

Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images

Sungjune Park, Hyunjun Kim, Beomchan Park, Yong Man Ro

TL;DR

The paper tackles the challenge of robust object detection in aerial images plagued by both scene-level (weather, illumination) and instance-level (viewpoint, scale) variations. It proposes LANGO, a language-guided learning framework that integrates a Visual Semantic Reasoner to capture scene semantics and a Relation Learning Loss to enforce language-like robustness to appearance changes, embedded in a transformer-based detector. Through training-time use of a large language model and scene-context language prompts, LANGO achieves state-of-the-art results on UAVDT and VisDrone with minimal inference overhead. The work demonstrates that incorporating language representations and context during training can significantly enhance robustness to diverse variations, with practical implications for safety, surveillance, and environmental monitoring.

Abstract

Despite recent advancements in computer vision research, object detection in aerial images still suffers from several challenges. One primary challenge to be mitigated is the presence of multiple types of variation in aerial images, for example, illumination and viewpoint changes. These variations result in highly diverse image scenes and drastic alterations in object appearance, so that it becomes more complicated to localize objects from the whole image scene and recognize their categories. To address this problem, in this paper, we introduce a novel object detection framework in aerial images, named LANGuage-guided Object detection (LANGO). Upon the proposed language-guided learning, the proposed framework is designed to alleviate the impacts from both scene and instance-level variations. First, we are motivated by the way humans understand the semantics of scenes while perceiving environmental factors in the scenes (e.g., weather). Therefore, we design a visual semantic reasoner that comprehends visual semantics of image scenes by interpreting conditions where the given images were captured. Second, we devise a training objective, named relation learning loss, to deal with instance-level variations, such as viewpoint angle and scale changes. This training objective aims to learn relations in language representations of object categories, with the help of the robust characteristics against such variations. Through extensive experiments, we demonstrate the effectiveness of the proposed method, and our method obtains noticeable detection performance improvements.

Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images

TL;DR

The paper tackles the challenge of robust object detection in aerial images plagued by both scene-level (weather, illumination) and instance-level (viewpoint, scale) variations. It proposes LANGO, a language-guided learning framework that integrates a Visual Semantic Reasoner to capture scene semantics and a Relation Learning Loss to enforce language-like robustness to appearance changes, embedded in a transformer-based detector. Through training-time use of a large language model and scene-context language prompts, LANGO achieves state-of-the-art results on UAVDT and VisDrone with minimal inference overhead. The work demonstrates that incorporating language representations and context during training can significantly enhance robustness to diverse variations, with practical implications for safety, surveillance, and environmental monitoring.

Abstract

Despite recent advancements in computer vision research, object detection in aerial images still suffers from several challenges. One primary challenge to be mitigated is the presence of multiple types of variation in aerial images, for example, illumination and viewpoint changes. These variations result in highly diverse image scenes and drastic alterations in object appearance, so that it becomes more complicated to localize objects from the whole image scene and recognize their categories. To address this problem, in this paper, we introduce a novel object detection framework in aerial images, named LANGuage-guided Object detection (LANGO). Upon the proposed language-guided learning, the proposed framework is designed to alleviate the impacts from both scene and instance-level variations. First, we are motivated by the way humans understand the semantics of scenes while perceiving environmental factors in the scenes (e.g., weather). Therefore, we design a visual semantic reasoner that comprehends visual semantics of image scenes by interpreting conditions where the given images were captured. Second, we devise a training objective, named relation learning loss, to deal with instance-level variations, such as viewpoint angle and scale changes. This training objective aims to learn relations in language representations of object categories, with the help of the robust characteristics against such variations. Through extensive experiments, we demonstrate the effectiveness of the proposed method, and our method obtains noticeable detection performance improvements.

Paper Structure

This paper contains 24 sections, 3 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The examples illustrate that there exist multiple variations. (a) shows the scene-level variations (e.g., weather and illumination) and such environmental factors make the entire scenes varying. In the given examples, visual semantics differ from each other. The first image contains very clean weather and is taken at daytime, so that both object instances and backgrounds are well visible and distinguishable. On the other hand, since the third image is captured under rainy weather at nighttime, the scene is quite dim and objects are also less visible. Moreover, (b) describes that instance-level variations, such as changes in viewpoint angle and scale, make object instances look different even between objects within same categories. For example, person instances are very different depending on their viewpoint angles and scales. Even though they are wearing white shirts and dark pants, their appearances varies. Also, two black vehicle also look different because of their scale and viewpoint.
  • Figure 2: The overall architecture of the proposed method. While the object detection framework takes an input aerial image and extracts image features, we incorporate a visual semantic reasoner to comprehend visual semantics of the given image and adapt to diverse scene-level variations (e.g., weather and illumination). Moreover, when the detection framework is trained to predict object categories from visual object features, we additionally add a relation learning loss which is designed to resemble language representation relations between object instances, which are robust against instance-level variations.
  • Figure 3: The details of the visual semantic reasoner consisting of vision-to-language (V2L) cross attention. It takes both image features and scene context prompt and utilizes them as query and key/value features, respectively.
  • Figure 4: (a) shows t-SNE feature visualization results demonstrating that language instance representations are gathered together by object category, rather than being scattered by instance-level variation. These object categories include the categories from UAVDT dataset uavdt and the others which probably exist in aerial images. (b) explains the process of deriving a similarity vector for the $i$-th category by using instance representations in language domain. These relation vector is used for the relation loss to guide the visual object features to learn the robust relations of language instance representations.
  • Figure 5: The t-SNE visualization of language instance representations for object categories in UAVDT and VisDrone. Based on the language representations which are distinguishable from each other category, the learnable categorical prompts help them more distinct and robust against instance-level variations.
  • ...and 1 more figures