Table of Contents
Fetching ...

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen

TL;DR

UniPixel addresses the gap in fine-grained pixel-level visual reasoning by unifying object referring and segmentation within a single large multimodal model. It introduces a novel object memory bank, a prompt encoder for sparse and dense prompts, and a SAM 2.1–based mask decoder, all integrated with an LLM and trained via a three-stage alignment pipeline. The approach yields state-of-the-art results across 10 pixel-level benchmarks, including challenging video tasks, and introduces PixelQA to jointly require referring, segmentation, and QA in videos. The memory-unified framework demonstrates mutual reinforcement between referring and segmentation and enables mask-grounded reasoning with flexible visual prompts, promising practical impact for fine-grained visual understanding in real-world applications.

Abstract

Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

TL;DR

UniPixel addresses the gap in fine-grained pixel-level visual reasoning by unifying object referring and segmentation within a single large multimodal model. It introduces a novel object memory bank, a prompt encoder for sparse and dense prompts, and a SAM 2.1–based mask decoder, all integrated with an LLM and trained via a three-stage alignment pipeline. The approach yields state-of-the-art results across 10 pixel-level benchmarks, including challenging video tasks, and introduces PixelQA to jointly require referring, segmentation, and QA in videos. The memory-unified framework demonstrates mutual reinforcement between referring and segmentation and enables mask-grounded reasoning with flexible visual prompts, promising practical impact for fine-grained visual understanding in real-world applications.

Abstract

Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.

Paper Structure

This paper contains 25 sections, 11 figures, 17 tables.

Figures (11)

  • Figure 1: UniPixel flexibly supports a large variety of fine-grained image and video understanding tasks, including referring/reasoning/interactive segmentation, motion-grounded video reasoning, and referred video description & question answering. It can also handle a novel PixelQA task that jointly requires object-centric referring, segmentation, and question answering in videos.
  • Figure 2: Schematic comparison between UniPixel and its counterparts. To the best of our knowledge, UniPixel is the first unified method supporting simultaneous object referring and segmentation.
  • Figure 3: The architecture of UniPixel. Given a video, a question, and visual prompts, the model encodes them into tokens via the visual encoder, prompt encoder, and tokenizer, respectively, then predicts a spatial-temporal mask for each visual prompt via the mask decoder. The masks are updated into the object memory bank, and subsequently injected into the prompt for pixel-level reasoning.
  • Figure 4: Joint positional & temporal encoding for point ($X_1 Y_1 T$) and box ($X_1 Y_1 X_2 Y_2 T$) prompts.
  • Figure 5: Visualization of the outputs from UniPixel on PixelQA task. Star marks and boxes refer to point and box prompts, respectively. The boxed frames denote where the visual prompts are applied. Given different types of visual prompts on a single frame, our method can flexibly infer the relevant object, track it across the entire video, and involve its features in reasoning.
  • ...and 6 more figures