Table of Contents
Fetching ...

Pixel-Level Reasoning Segmentation via Multi-turn Conversations

Dexian Cai, Xiaocui Yang, Yongkang Liu, Daling Wang, Shi Feng, Yifei Zhang, Soujanya Poria

TL;DR

The paper defines Pixel-level Reasoning Segmentation (Pixel-level RS) and introduces a multi-turn dialogue framework to achieve pixel-precise segmentation. It presents PRIST, a dataset with 24k utterances and 8.3k multi-turn conversations for fine-grained targets, generated via a three-stage reasoning pipeline inspired by Tree-of-Thought. The MIRAS framework integrates dual visual encoders, semantic region alignment, and a segmentation prompt to enable pixel-grounded explanations and segmentation under progressive reasoning. It also establishes comprehensive evaluation metrics, including LLM-based reasoning scores, and demonstrates that MIRAS surpasses relevant baselines in both segmentation accuracy and reasoning quality, approaching human performance. Together, PRIST and MIRAS advance the field of pixel-level reasoning segmentation by combining rich conversational reasoning with precise grounding at the pixel level, enabling more faithful and interpretable visual understanding in interactive systems.

Abstract

Existing visual perception systems focus on region-level segmentation in single-turn dialogues, relying on complex and explicit query instructions. Such systems cannot reason at the pixel level and comprehend dynamic user intent that changes over interaction. Our work tackles this issue by introducing a novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on multi-turn conversations, tracking evolving user intent via multi-turn interactions for fine-grained segmentation. To establish a benchmark for this novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k multi-turn conversational scenarios with segmentation targets. Building on PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning Segmentation framework, integrates pixel-level segmentation with robust multi-turn conversation understanding, generating pixel-grounded explanations aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in pixel-level reasoning segmentation. Experimental results on the PRIST dataset demonstrate that our method outperforms current segmentation-specific baselines in terms of segmentation and LLM-based reasoning metrics. The code and data are available at: https://github.com/ccccai239/PixelRIST.

Pixel-Level Reasoning Segmentation via Multi-turn Conversations

TL;DR

The paper defines Pixel-level Reasoning Segmentation (Pixel-level RS) and introduces a multi-turn dialogue framework to achieve pixel-precise segmentation. It presents PRIST, a dataset with 24k utterances and 8.3k multi-turn conversations for fine-grained targets, generated via a three-stage reasoning pipeline inspired by Tree-of-Thought. The MIRAS framework integrates dual visual encoders, semantic region alignment, and a segmentation prompt to enable pixel-grounded explanations and segmentation under progressive reasoning. It also establishes comprehensive evaluation metrics, including LLM-based reasoning scores, and demonstrates that MIRAS surpasses relevant baselines in both segmentation accuracy and reasoning quality, approaching human performance. Together, PRIST and MIRAS advance the field of pixel-level reasoning segmentation by combining rich conversational reasoning with precise grounding at the pixel level, enabling more faithful and interpretable visual understanding in interactive systems.

Abstract

Existing visual perception systems focus on region-level segmentation in single-turn dialogues, relying on complex and explicit query instructions. Such systems cannot reason at the pixel level and comprehend dynamic user intent that changes over interaction. Our work tackles this issue by introducing a novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on multi-turn conversations, tracking evolving user intent via multi-turn interactions for fine-grained segmentation. To establish a benchmark for this novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k multi-turn conversational scenarios with segmentation targets. Building on PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning Segmentation framework, integrates pixel-level segmentation with robust multi-turn conversation understanding, generating pixel-grounded explanations aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in pixel-level reasoning segmentation. Experimental results on the PRIST dataset demonstrate that our method outperforms current segmentation-specific baselines in terms of segmentation and LLM-based reasoning metrics. The code and data are available at: https://github.com/ccccai239/PixelRIST.

Paper Structure

This paper contains 47 sections, 10 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: RS vs. Pixel-level RS. Pixel-level RS refines intent understanding and segmentation (e.g., "oil bottle") through multi-turn interactions, while RS produces rough segmentation (e.g., "all ingredients") and handles implicit single-turn queries poorly.
  • Figure 2: The Generation Pipeline of PRIST Dataset. i) Step 1 extracts visible elements from images, establishing a semantic foundation for subsequent steps. ii) Step 2-1 generates complex reasoning questions from these elements, while Step 2-2 iteratively refines the questions through a reasoning tree, ensuring rigorous reasoning. iii) Step 3 organizes the nodes in reasoning tree into a multi-turn dialogue format.
  • Figure 3: The focus distribution of PRIST. We analyze focus objects across 3 dimensions: noun, adjective and preposition, which capture fine granularity, diversity, and close spatial relationships between objects.
  • Figure 4: Overview Architecture of MIRAS. The model integrates MLLM and SAM modules by introducing a special token [SEG]. MIRAS can perform both (a) Multi-turn Response and (b) Segmentation tasks end-to-end.
  • Figure 5: Wordcloud of the 200 popular focus-related words in PRIST.
  • ...and 5 more figures