Table of Contents
Fetching ...

MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing

Karim Radouane, Hanane Azzag, Mustapha lebbah

TL;DR

MB-ORES addresses visual grounding in remote sensing by unifying open-set object detection with referring expression comprehension. It introduces a two-stage architecture that first fine-tunes GroundingDINO on REC data to produce graph-structured object proposals, then uses a three-branch cross-modal network plus an object reasoner with soft query selection to localize the referred object, followed by a regression head for precise bounding boxes. The approach achieves state-of-the-art results on OPT-RSVG and DIOR-RSVG, while retaining OD capabilities and demonstrating strong ablation-supported efficacy of multi-branch fusion and reasoning. This work advances RS VG by leveraging explicit priors and cross-modal reasoning, with practical implications for zero-shot reasoning and RS scene understanding.

Abstract

We propose a unified framework that integrates object detection (OD) and visual grounding (VG) for remote sensing (RS) imagery. To support conventional OD and establish an intuitive prior for VG task, we fine-tune an open-set object detector using referring expression data, framing it as a partially supervised OD task. In the first stage, we construct a graph representation of each image, comprising object queries, class embeddings, and proposal locations. Then, our task-aware architecture processes this graph to perform the VG task. The model consists of: (i) a multi-branch network that integrates spatial, visual, and categorical features to generate task-aware proposals, and (ii) an object reasoning network that assigns probabilities across proposals, followed by a soft selection mechanism for final referring object localization. Our model demonstrates superior performance on the OPT-RSVG and DIOR-RSVG datasets, achieving significant improvements over state-of-the-art methods while retaining classical OD capabilities. The code will be available in our repository: \url{https://github.com/rd20karim/MB-ORES}.

MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing

TL;DR

MB-ORES addresses visual grounding in remote sensing by unifying open-set object detection with referring expression comprehension. It introduces a two-stage architecture that first fine-tunes GroundingDINO on REC data to produce graph-structured object proposals, then uses a three-branch cross-modal network plus an object reasoner with soft query selection to localize the referred object, followed by a regression head for precise bounding boxes. The approach achieves state-of-the-art results on OPT-RSVG and DIOR-RSVG, while retaining OD capabilities and demonstrating strong ablation-supported efficacy of multi-branch fusion and reasoning. This work advances RS VG by leveraging explicit priors and cross-modal reasoning, with practical implications for zero-shot reasoning and RS scene understanding.

Abstract

We propose a unified framework that integrates object detection (OD) and visual grounding (VG) for remote sensing (RS) imagery. To support conventional OD and establish an intuitive prior for VG task, we fine-tune an open-set object detector using referring expression data, framing it as a partially supervised OD task. In the first stage, we construct a graph representation of each image, comprising object queries, class embeddings, and proposal locations. Then, our task-aware architecture processes this graph to perform the VG task. The model consists of: (i) a multi-branch network that integrates spatial, visual, and categorical features to generate task-aware proposals, and (ii) an object reasoning network that assigns probabilities across proposals, followed by a soft selection mechanism for final referring object localization. Our model demonstrates superior performance on the OPT-RSVG and DIOR-RSVG datasets, achieving significant improvements over state-of-the-art methods while retaining classical OD capabilities. The code will be available in our repository: \url{https://github.com/rd20karim/MB-ORES}.

Paper Structure

This paper contains 21 sections, 16 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Unlike previous approaches, our framework is designed to retain object detection capabilities while providing users with essential information to simplify query formulation for their object of interest.
  • Figure 2: Our Overall Framework (MB-ORES): In the first stage, the object detector is trained on partially annotated images from the REC data, producing output structured as a graph. In the second stage, these outputs are processed through a multi-branch network, fused into task-aware object proposals, and refined using reasoning and selection modules to generate the final representation for referred object localization.
  • Figure 3: DIOR-RSVG: At the top of the image, the results for the REC task are shown (prediction in red), while at the bottom, the OD task is performed simultaneously using our unified approach.
  • Figure 4: DIOR-RSVG: Visual Grounding of multiple referring expressions per image.
  • Figure 5: OPT-RSVG: Visual Grounding of multiple referring expressions per image (ground-truth in dashed gray color).
  • ...and 4 more figures