Table of Contents
Fetching ...

DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes

Sungjune Park, Hyunjun Kim, Junho Kim, Seongho Kim, Yong Man Ro

TL;DR

Multimodal large language models still struggle with fine-grained instance perception in crowded real-world scenes. DIP-R1 introduces a reinforcement-learning framework built on GRPO that guides MLLMs to reason, inspect uncertain regions, and make accurate instance predictions via three rule-based rewards: a Think-Look-Answer format reward, a variance-guided look reward, and a weighted precision-recall accuracy reward. The approach yields consistent, significant gains over baselines and SFT across four real-world datasets (CrowdHuman, CityPersons, WiderPedestrian, UAVDT) and demonstrates improved generalization to out-of-domain scenes. This work demonstrates that incorporating structured RL signals into MLLMs can substantially boost fine-grained visual perception with practical implications for crowd safety and surveillance tasks.

Abstract

MLLMs have demonstrated significant visual understanding capabilities, yet their fine-grained visual perception in complex real-world scenarios, such as densely crowded public areas, remains limited. Inspired by the recent success of RL in both LLMs and MLLMs, in this paper, we explore how RL can enhance visual perception ability of MLLMs. Then we develop a novel RL-based framework, Deep Inspection and Perception with RL (DIP-R1) designed to enhance the visual perception capabilities of MLLMs, by comprehending complex scenes and looking through visual instances closely. DIP-R1 guides MLLMs through detailed inspection of visual scene via three simply designed rule-based reward modeling. First, we adopt a standard reasoning reward encouraging the model to include three-step reasoning process: 1) comprehending entire visual scene, 2) observing for looking through interested but ambiguous regions, and 3) decision-making for predicting answer. Second, a variance-guided looking reward is designed to encourage MLLM to examine uncertain regions during the observing process, guiding it to inspect ambiguous areas and mitigate perceptual uncertainty. This reward promotes variance-driven visual exploration, enabling MLLM to reason about region-level uncertainty and explicitly indicate interpretable uncertain regions. Third, we model a weighted precision-recall accuracy reward enhancing accurate decision-making. We verify its effectiveness across diverse fine-grained object detection data consisting of challenging real-world scenes, such as densely crowded scenes. Built upon existing MLLMs, DIP-R1 achieves consistent and significant improvement across various in-domain and out-of-domain scenarios, outperforming various existing baselines and SFT method. Our findings highlight the substantial potential of integrating RL into MLLMs for enhancing capabilities in complex real-world perception tasks.

DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes

TL;DR

Multimodal large language models still struggle with fine-grained instance perception in crowded real-world scenes. DIP-R1 introduces a reinforcement-learning framework built on GRPO that guides MLLMs to reason, inspect uncertain regions, and make accurate instance predictions via three rule-based rewards: a Think-Look-Answer format reward, a variance-guided look reward, and a weighted precision-recall accuracy reward. The approach yields consistent, significant gains over baselines and SFT across four real-world datasets (CrowdHuman, CityPersons, WiderPedestrian, UAVDT) and demonstrates improved generalization to out-of-domain scenes. This work demonstrates that incorporating structured RL signals into MLLMs can substantially boost fine-grained visual perception with practical implications for crowd safety and surveillance tasks.

Abstract

MLLMs have demonstrated significant visual understanding capabilities, yet their fine-grained visual perception in complex real-world scenarios, such as densely crowded public areas, remains limited. Inspired by the recent success of RL in both LLMs and MLLMs, in this paper, we explore how RL can enhance visual perception ability of MLLMs. Then we develop a novel RL-based framework, Deep Inspection and Perception with RL (DIP-R1) designed to enhance the visual perception capabilities of MLLMs, by comprehending complex scenes and looking through visual instances closely. DIP-R1 guides MLLMs through detailed inspection of visual scene via three simply designed rule-based reward modeling. First, we adopt a standard reasoning reward encouraging the model to include three-step reasoning process: 1) comprehending entire visual scene, 2) observing for looking through interested but ambiguous regions, and 3) decision-making for predicting answer. Second, a variance-guided looking reward is designed to encourage MLLM to examine uncertain regions during the observing process, guiding it to inspect ambiguous areas and mitigate perceptual uncertainty. This reward promotes variance-driven visual exploration, enabling MLLM to reason about region-level uncertainty and explicitly indicate interpretable uncertain regions. Third, we model a weighted precision-recall accuracy reward enhancing accurate decision-making. We verify its effectiveness across diverse fine-grained object detection data consisting of challenging real-world scenes, such as densely crowded scenes. Built upon existing MLLMs, DIP-R1 achieves consistent and significant improvement across various in-domain and out-of-domain scenarios, outperforming various existing baselines and SFT method. Our findings highlight the substantial potential of integrating RL into MLLMs for enhancing capabilities in complex real-world perception tasks.

Paper Structure

This paper contains 21 sections, 10 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: (a) shows public scene examples. While Qwen2.5-VL (3B) qwen2.5-vl properly separate each person within the image including relatively large and less overlapped people. However, It fails to separate individuals and consider them as a single entity in the complex scenes. (b) shows that the proposed DIP-R1 obtains large improvements compared to existing methods.
  • Figure 2: Overall framework of the proposed DIP-R1. Given an input sample, both the reference and policy models generate responses for KL regularization. DIP-R1 computes rewards based on three reward functions: 1) Think-Look-Answer format reward to include <think> reasoning </think><look> observing </look><answer> decision </answer>, 2) Variance-guided look reward to examine uncertain regions in observing process, and 3) Weighted precision-recall reward to obtain accurate answer by considering both precision and recall.
  • Figure 3: The visualization results of Qwen2.5-VL-Instruct (3B), SFT on Qwen2.5-VL-Instruct (3B), DeepSeek-VL2-Small (16B), and DIP-R1 (3B), along with ground-truth regions. As shown in the figure, the existing baseline models easily fail to perceive each individual and tend to miss or consider them as a single entity. On the other hand, our method separates and recognizes each of them properly.
  • Figure 4: Output analysis of DIP-R1 framework, showing the reasoning description, the uncertain regions (green boxes) captured in the observing process, and the prediction results (orange boxes).
  • Figure 5: Qualitative comparative analysis between the SFT baseline and DIP-R1, where both are built upon Qwen2.5-VL-Instruct (3B). As described in (a), SFT usually exhibits noisy prediction results which may lead to high recall but low precision. On the other hand, as shown in (b), DIP-R1 performs more reliable prediction.
  • ...and 3 more figures