Table of Contents
Fetching ...

EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing

Hongxiang Jiang, Jihao Yin, Qixiong Wang, Jiaqi Feng, Guo Chen

TL;DR

EagleVision tackles the challenge of dense, object-centered understanding in remote sensing by unifying object detection with fine-grained attribute description through an Attribute Disentangle module and a frozen LLM-based description module. It introduces EVAttrs-95K for instruction tuning and EVBench for evaluation, and demonstrates that disentangled vision tokens coupled with orthogonal subspace learning yield clearer attribute representations. The approach delivers state-of-the-art results on multiple RS datasets for both object detection and object attribute understanding, and ablations confirm the value of token disentanglement and targeted losses. This work advances RS applications by enabling accurate, interpretable per-object reasoning and attribute grounding at scale.

Abstract

Recent advances in multimodal large language models (MLLMs) have demonstrated impressive results in various visual tasks. However, in remote sensing (RS), high resolution and small proportion of objects pose challenges to existing MLLMs, which struggle with object-centric tasks, particularly in precise localization and fine-grained attribute description for each object. These RS MLLMs have not yet surpassed classical visual perception models, as they only provide coarse image understanding, leading to limited gains in real-world scenarios. To address this gap, we establish EagleVision, an MLLM tailored for remote sensing that excels in object detection and attribute comprehension. Equipped with the Attribute Disentangle module, EagleVision learns disentanglement vision tokens to express distinct attributes. To support object-level visual-language alignment, we construct EVAttrs-95K, the first large-scale object attribute understanding dataset in RS for instruction tuning, along with a novel evaluation benchmark, EVBench. EagleVision achieves state-of-the-art performance on both fine-grained object detection and object attribute understanding tasks, highlighting the mutual promotion between detection and understanding capabilities in MLLMs. The code, model, data, and demo will be available at https://github.com/XiangTodayEatsWhat/EagleVision.

EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing

TL;DR

EagleVision tackles the challenge of dense, object-centered understanding in remote sensing by unifying object detection with fine-grained attribute description through an Attribute Disentangle module and a frozen LLM-based description module. It introduces EVAttrs-95K for instruction tuning and EVBench for evaluation, and demonstrates that disentangled vision tokens coupled with orthogonal subspace learning yield clearer attribute representations. The approach delivers state-of-the-art results on multiple RS datasets for both object detection and object attribute understanding, and ablations confirm the value of token disentanglement and targeted losses. This work advances RS applications by enabling accurate, interpretable per-object reasoning and attribute grounding at scale.

Abstract

Recent advances in multimodal large language models (MLLMs) have demonstrated impressive results in various visual tasks. However, in remote sensing (RS), high resolution and small proportion of objects pose challenges to existing MLLMs, which struggle with object-centric tasks, particularly in precise localization and fine-grained attribute description for each object. These RS MLLMs have not yet surpassed classical visual perception models, as they only provide coarse image understanding, leading to limited gains in real-world scenarios. To address this gap, we establish EagleVision, an MLLM tailored for remote sensing that excels in object detection and attribute comprehension. Equipped with the Attribute Disentangle module, EagleVision learns disentanglement vision tokens to express distinct attributes. To support object-level visual-language alignment, we construct EVAttrs-95K, the first large-scale object attribute understanding dataset in RS for instruction tuning, along with a novel evaluation benchmark, EVBench. EagleVision achieves state-of-the-art performance on both fine-grained object detection and object attribute understanding tasks, highlighting the mutual promotion between detection and understanding capabilities in MLLMs. The code, model, data, and demo will be available at https://github.com/XiangTodayEatsWhat/EagleVision.

Paper Structure

This paper contains 22 sections, 7 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: EagleVision for object-level attribute understanding. In contrast to visual perception models (VPMs) and MLLMs, which contribute little to object-level comprehension in remote sensing, EagleVision outperforms in object attribute understanding, covering various attributes of all detected objects. The prompt for generating the MLLMs results is shown in the Appendix.
  • Figure 2: The overall architecture of EagleVision. EagleVision consists of three main components: Baseline Detector, Attribute Disentangle and Object-level Description, enabling object detection and object attribute understanding tasks.
  • Figure 3: Visualization of the correlation between vision tokens and attributes. The horizontal axis represents different dimensions of vision tokens, and the vertical axis represents their attributes, where sls, hc, hs, ds, da denote ship-load-status, hull-color, hull-size, deck-structure, deck-accessories, respectively.
  • Figure 4: Visualization results on ShipRSImageNet and FAIR1M datasets. The results of RTMDet and GroundTruth only include detection, while the results of GeoChat are from the response to predefined prompt. In EagleVision, we highlight the crucial attribute understanding content, which promotes the correct detection of the object category.
  • Figure 5: Annotation process diagram.
  • ...and 7 more figures