Table of Contents
Fetching ...

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, Beng Chin Ooi

TL;DR

PixelRefer tackles the need for fine-grained spatiotemporal object reasoning in images and videos by introducing a Scale-Adaptive Object Tokenizer (SAOT) to produce compact, semantically rich object tokens and a lightweight Object-Centric Infusion (OCI) module for pre-fusing global context in an Object-Only framework. Guided by empirical analyses of LLM attention, PixelRefer presents two complementary paths: a full Vision-Object framework for rich context fusion and a PixelRefer-Lite variant that prioritizes efficiency without sacrificing semantic fidelity. A new PixelRefer-2.2M dataset supports fine-grained alignment between language and both global context and local regions, while VideoRefer-700K augments video-specific supervision. Across image and video benchmarks, PixelRefer achieves state-of-the-art results with fewer training samples, and PixelRefer-Lite offers substantial runtime and memory savings, enabling practical deployment for real-world region-level reasoning.

Abstract

Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

TL;DR

PixelRefer tackles the need for fine-grained spatiotemporal object reasoning in images and videos by introducing a Scale-Adaptive Object Tokenizer (SAOT) to produce compact, semantically rich object tokens and a lightweight Object-Centric Infusion (OCI) module for pre-fusing global context in an Object-Only framework. Guided by empirical analyses of LLM attention, PixelRefer presents two complementary paths: a full Vision-Object framework for rich context fusion and a PixelRefer-Lite variant that prioritizes efficiency without sacrificing semantic fidelity. A new PixelRefer-2.2M dataset supports fine-grained alignment between language and both global context and local regions, while VideoRefer-700K augments video-specific supervision. Across image and video benchmarks, PixelRefer achieves state-of-the-art results with fewer training samples, and PixelRefer-Lite offers substantial runtime and memory savings, enabling practical deployment for real-world region-level reasoning.

Abstract

Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.

Paper Structure

This paper contains 25 sections, 10 equations, 13 figures, 10 tables, 1 algorithm.

Figures (13)

  • Figure 1: PixelRefer, a unified region-level MLLM, supports a broad range of tasks at both object-level and scene-level, spanning spatial (images) and temporal (videos) domains. It enables fine-grained spatiotemporal reasoning over user-specified region with arbitrary semantic granularity, while preserving general-purpose capabilities for holistic visual understanding.
  • Figure 2: Quantitative Evaluation and Efficiency Analysis. (a) Performance Comparison: PixelRefer and PixelRefer-Lite consistently outperform state-of-the-art object-level MLLMs across diverse image (LVIS yuan2024osprey, PACO yuan2024osprey, DLC-Bench lian2025describe) and video (VideoRefer-Bench, HC-STVG tang2021human) benchmarks. (b) Data Efficiency: Our method achieves leading performance with fewer training samples compared to existing methods. (c) Runtime and Memory Efficiency: PixelRefer-Lite notably reduces inference time and memory usage, clearly demonstrating its efficiency.
  • Figure 3: Visualization of attention maps across different layers (Layer 1, 7, 14 and 28) of the LLM. The input sequence includes system tokens (sys), global image token (vision), text prompts (text), object-level tokens (object), and answer tokens (ans). For clarity, image tokens are average pooled by a factor of 8. The figure showcases how attention patterns evolve across layers over different tokens.
  • Figure 4: Visualization of answer-to-image attention heatmaps for different query regions. The model adaptively focuses on relevant objects while incorporating contextual cues from the surrounding areas.
  • Figure 5: Frameworks of two complementary paradigms for region-level representations in our approach: (a) illustrates Vision-Object Framework, while (b) presents Object-Only Framework.
  • ...and 8 more figures