EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing
Hongxiang Jiang, Jihao Yin, Qixiong Wang, Jiaqi Feng, Guo Chen
TL;DR
EagleVision tackles the challenge of dense, object-centered understanding in remote sensing by unifying object detection with fine-grained attribute description through an Attribute Disentangle module and a frozen LLM-based description module. It introduces EVAttrs-95K for instruction tuning and EVBench for evaluation, and demonstrates that disentangled vision tokens coupled with orthogonal subspace learning yield clearer attribute representations. The approach delivers state-of-the-art results on multiple RS datasets for both object detection and object attribute understanding, and ablations confirm the value of token disentanglement and targeted losses. This work advances RS applications by enabling accurate, interpretable per-object reasoning and attribute grounding at scale.
Abstract
Recent advances in multimodal large language models (MLLMs) have demonstrated impressive results in various visual tasks. However, in remote sensing (RS), high resolution and small proportion of objects pose challenges to existing MLLMs, which struggle with object-centric tasks, particularly in precise localization and fine-grained attribute description for each object. These RS MLLMs have not yet surpassed classical visual perception models, as they only provide coarse image understanding, leading to limited gains in real-world scenarios. To address this gap, we establish EagleVision, an MLLM tailored for remote sensing that excels in object detection and attribute comprehension. Equipped with the Attribute Disentangle module, EagleVision learns disentanglement vision tokens to express distinct attributes. To support object-level visual-language alignment, we construct EVAttrs-95K, the first large-scale object attribute understanding dataset in RS for instruction tuning, along with a novel evaluation benchmark, EVBench. EagleVision achieves state-of-the-art performance on both fine-grained object detection and object attribute understanding tasks, highlighting the mutual promotion between detection and understanding capabilities in MLLMs. The code, model, data, and demo will be available at https://github.com/XiangTodayEatsWhat/EagleVision.
