A Simple Aerial Detection Baseline of Multimodal Language Models

Qingyun Li; Yushi Chen; Xinya Shu; Dong Chen; Xin He; Yi Yu; Xue Yang

A Simple Aerial Detection Baseline of Multimodal Language Models

Qingyun Li, Yushi Chen, Xinya Shu, Dong Chen, Xin He, Yi Yu, Xue Yang

TL;DR

The paper addresses the challenge of applying multimodal language models to aerial, multi-class detection in remote sensing by introducing LMMRotate, a simple baseline that normalizes numerical detection outputs into text for MLM processing. It outlines a bimodal framework to fuse image features with textual prompts, and proposes a fair evaluation protocol using $mAP_{nc}$ and $mF_1$ to compare MLMs with conventional detectors. Empirical results show that fine-tuned MLMs can match or exceed traditional detectors across optical and SAR benchmarks, with joint training further boosting performance, especially on smaller datasets. This work provides a practical baseline and evaluation framework to extend RS MLM capabilities toward more comprehensive, multi-dataset aerial detection, contributing to broader RS-MLM and AGI development.

Abstract

The multimodal language models (MLMs) based on generative pre-trained Transformer are considered powerful candidates for unifying various domains and tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding performance in multiple tasks, such as visual question answering and visual grounding. In addition to visual grounding that detects specific objects corresponded to given instruction, aerial detection, which detects all objects of multiple categories, is also a valuable and challenging task for RS foundation models. However, aerial detection has not been explored by existing RS MLMs because the autoregressive prediction mechanism of MLMs differs significantly from the detection outputs. In this paper, we present a simple baseline for applying MLMs to aerial detection for the first time, named LMMRotate. Specifically, we first introduce a normalization method to transform detection outputs into textual outputs to be compatible with the MLM framework. Then, we propose a evaluation method, which ensures a fair comparison between MLMs and conventional object detection models. We construct the baseline by fine-tuning open-source general-purpose MLMs and achieve impressive detection performance comparable to conventional detector. We hope that this baseline will serve as a reference for future MLM development, enabling more comprehensive capabilities for understanding RS images. Code is available at https://github.com/Li-Qingyun/mllm-mmrotate.

A Simple Aerial Detection Baseline of Multimodal Language Models

TL;DR

and

to compare MLMs with conventional detectors. Empirical results show that fine-tuned MLMs can match or exceed traditional detectors across optical and SAR benchmarks, with joint training further boosting performance, especially on smaller datasets. This work provides a practical baseline and evaluation framework to extend RS MLM capabilities toward more comprehensive, multi-dataset aerial detection, contributing to broader RS-MLM and AGI development.

Abstract

Paper Structure (11 sections, 3 equations, 4 figures, 1 table)

This paper contains 11 sections, 3 equations, 4 figures, 1 table.

Introduction
Method
Preliminary of Multimodal Language Models
Normalization of Detection Outputs
Evaluation of MLM detectors
Experiment
Benchmark Datasets
Evaluation Settings
Implementation Details
Comparison Results
Conclusion

Figures (4)

Figure 1: Visualization of the objects detected by our MLM detector based on Florence-2-large florence2 with single dataset setting. The images are selected from the test sets of DOTA-v1.0 DOTA and RSAR RSAR.
Figure 2: The overall framework of the proposed MLM detector baseline.
Figure 3: An example of a RS image and its response that contains category names and 8-parameter polygon boxes of objects.
Figure 4: The impact of confidence scores on mAP / $\text{mAP}_{\text{nc}}$ with error bands. The colored lines record the variation trends of $\text{mAP}_{\text{nc}}$ for the two popular conventional detector on DOTA-v1.0 DOTA (trained on 'train' split and evaluated on 'validation' split) and DIOR-R DIOR (trained on 'trainval' split and evaluated on 'test' split, and the input size is $800 \times 800$) datasets under different confidence thresholds.

A Simple Aerial Detection Baseline of Multimodal Language Models

TL;DR

Abstract

A Simple Aerial Detection Baseline of Multimodal Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)