Table of Contents
Fetching ...

Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen

TL;DR

Active-O3 presents Ac-ti-ve--o3, a reinforcement learning framework that endows Multimodal Large Language Models with active perception by learning a sensing policy and a task policy under a fixed zoom budget. It formalizes MLLM-based active perception, specializes to a 2D static-scene setting, and uses GRPO to train a two-stage policy with a dual-form reward (heuristic and task-aware). The authors build a comprehensive benchmark across open-world grounding, domain-specific detection, and fine-grained interactive segmentation; results show improved search efficiency and task performance, with strong zero-shot reasoning on the V* benchmark. They also provide a codebase and evaluation protocol to foster future work in active perception for MLLMs.

Abstract

Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention. However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities. In this paper, we first provide a systematic definition of MLLM-based active perception tasks. We point out that the recently proposed GPT-o3 model's zoom-in search strategy can be regarded as a special case of active perception; however, it still suffers from low search efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-O3, a purely reinforcement learning based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks, such as small-object and dense object grounding, and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot reasoning abilities on the V* Benchmark, without relying on any explicit reasoning data. We hope that our work can provide a simple codebase and evaluation protocol to facilitate future research on active perception in MLLMs.

Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

TL;DR

Active-O3 presents Ac-ti-ve--o3, a reinforcement learning framework that endows Multimodal Large Language Models with active perception by learning a sensing policy and a task policy under a fixed zoom budget. It formalizes MLLM-based active perception, specializes to a 2D static-scene setting, and uses GRPO to train a two-stage policy with a dual-form reward (heuristic and task-aware). The authors build a comprehensive benchmark across open-world grounding, domain-specific detection, and fine-grained interactive segmentation; results show improved search efficiency and task performance, with strong zero-shot reasoning on the V* benchmark. They also provide a codebase and evaluation protocol to foster future work in active perception for MLLMs.

Abstract

Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention. However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities. In this paper, we first provide a systematic definition of MLLM-based active perception tasks. We point out that the recently proposed GPT-o3 model's zoom-in search strategy can be regarded as a special case of active perception; however, it still suffers from low search efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-O3, a purely reinforcement learning based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks, such as small-object and dense object grounding, and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot reasoning abilities on the V* Benchmark, without relying on any explicit reasoning data. We hope that our work can provide a simple codebase and evaluation protocol to facilitate future research on active perception in MLLMs.

Paper Structure

This paper contains 59 sections, 14 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Zero-shot reasoning on the $V^*$ benchmark. When asked “Tell me the number on the traffic light”, Qwen2.5 VL incorrectly refers to unrelated text. In contrast, Ac-ti-ve--o3 locates and magnifies the precise area on the traffic light, accurately answering 10 through effective spatial localization.
  • Figure 2: Overview of the proposed Active-O3 framework. Given a multimodal query (e.g., "find all coins"), traditional task models often miss or misidentify target objects due to limited perceptual coverage. Active-O3 enhances perception by allowing the model to actively propose informative subregions (zoom-in regions) based on a learnable sensing policy.
  • Figure 3: Prompt for Ac-ti-ve--o3 -DET.
  • Figure 4: Visualization details of our proposed method on three datasets.
  • Figure 5: Comparison of segmentation performance (mIoU) under different zoom-in budgets.
  • ...and 11 more figures

Theorems & Definitions (4)

  • Remark D.1: MLLM-Driven Action and Sensing Modules
  • Remark D.2: Optimization Strategy
  • Remark D.3: 2D Setting as a Single-Step Active Perception Problem
  • Remark D.4: GPT-o3 vs. Ac-ti-ve--o3