Table of Contents
Fetching ...

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qingpei Guo, Yang Liu, Ming Yang, Chunhua Shen

TL;DR

This work introduces the Human-Like Mask Annotation Task (HLMAT), a protocol in which MLLMs imitate human annotators via interactive segmentation tools to achieve pixel-level understanding without relying on implicit tokens or external pixel decoders. By modeling segmentation as a multi-step decision process, the authors train SegAgent on trajectories generated from existing segmentation data and enhance it with policy-improvement (StaR+) and PRM-guided tree search to boost robustness in complex scenes. The approach yields competitive results on RES and a newly proposed high-quality HRES dataset, and extends to mask refinement and annotation filtering, highlighting the potential of vision-centered, multi-step decision-making in MLLMs. The work thus provides a unified, text-output-based framework for fine-grained pixel perception with practical implications for open-world segmentation and beyond.

Abstract

While MLLMs have demonstrated adequate image understanding capabilities, they still struggle with pixel-level comprehension, limiting their practical applications. Current evaluation tasks like VQA and visual grounding remain too coarse to assess fine-grained pixel comprehension accurately. Though segmentation is foundational for pixel-level understanding, existing methods often require MLLMs to generate implicit tokens, decoded through external pixel decoders. This approach disrupts the MLLM's text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model's intrinsic pixel-level understanding. Thus, we introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. Modeling segmentation as a multi-step Markov Decision Process, HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. Through this setup, we develop SegAgent, a model fine-tuned on human-like annotation trajectories, which achieves performance comparable to state-of-the-art (SOTA) methods and supports additional tasks like mask refinement and annotation filtering. HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task that facilitates exploration of MLLMs' visual reasoning abilities. Our adaptations of policy improvement method StaR and PRM-guided tree search further enhance model robustness in complex segmentation tasks, laying a foundation for future advancements in fine-grained visual perception and multi-step decision-making for MLLMs.

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

TL;DR

This work introduces the Human-Like Mask Annotation Task (HLMAT), a protocol in which MLLMs imitate human annotators via interactive segmentation tools to achieve pixel-level understanding without relying on implicit tokens or external pixel decoders. By modeling segmentation as a multi-step decision process, the authors train SegAgent on trajectories generated from existing segmentation data and enhance it with policy-improvement (StaR+) and PRM-guided tree search to boost robustness in complex scenes. The approach yields competitive results on RES and a newly proposed high-quality HRES dataset, and extends to mask refinement and annotation filtering, highlighting the potential of vision-centered, multi-step decision-making in MLLMs. The work thus provides a unified, text-output-based framework for fine-grained pixel perception with practical implications for open-world segmentation and beyond.

Abstract

While MLLMs have demonstrated adequate image understanding capabilities, they still struggle with pixel-level comprehension, limiting their practical applications. Current evaluation tasks like VQA and visual grounding remain too coarse to assess fine-grained pixel comprehension accurately. Though segmentation is foundational for pixel-level understanding, existing methods often require MLLMs to generate implicit tokens, decoded through external pixel decoders. This approach disrupts the MLLM's text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model's intrinsic pixel-level understanding. Thus, we introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. Modeling segmentation as a multi-step Markov Decision Process, HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. Through this setup, we develop SegAgent, a model fine-tuned on human-like annotation trajectories, which achieves performance comparable to state-of-the-art (SOTA) methods and supports additional tasks like mask refinement and annotation filtering. HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task that facilitates exploration of MLLMs' visual reasoning abilities. Our adaptations of policy improvement method StaR and PRM-guided tree search further enhance model robustness in complex segmentation tasks, laying a foundation for future advancements in fine-grained visual perception and multi-step decision-making for MLLMs.

Paper Structure

This paper contains 35 sections, 8 figures, 6 tables, 4 algorithms.

Figures (8)

  • Figure 1: The overall framework of SegAgent. The image below shows a complete set of trajectories. We visualize current action $a_t$ and the resulting mask $M_{t+1}$ in one image.
  • Figure 2: An example of generated trajectory. We visualize current action $a_t$ and the resulting mask $M_{t+1}$ in one image. Due to the noise from GT Mask, the action for Iteration 3,4 is meaningless
  • Figure 3: Comparison of dataset complexity.
  • Figure 4: Comparison of different strategies on the HRES dataset.
  • Figure 5: An illustrative example of PRM-guided tree search. The model predicts the reward at each step and selects the action with the highest reward to generate the next mask.
  • ...and 3 more figures