PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

Yuqian Yuan; Wenqiao Zhang; Xin Li; Shihao Wang; Kehan Li; Wentong Li; Jun Xiao; Lei Zhang; Beng Chin Ooi

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, Beng Chin Ooi

TL;DR

PixelRefer tackles the need for fine-grained spatiotemporal object reasoning in images and videos by introducing a Scale-Adaptive Object Tokenizer (SAOT) to produce compact, semantically rich object tokens and a lightweight Object-Centric Infusion (OCI) module for pre-fusing global context in an Object-Only framework. Guided by empirical analyses of LLM attention, PixelRefer presents two complementary paths: a full Vision-Object framework for rich context fusion and a PixelRefer-Lite variant that prioritizes efficiency without sacrificing semantic fidelity. A new PixelRefer-2.2M dataset supports fine-grained alignment between language and both global context and local regions, while VideoRefer-700K augments video-specific supervision. Across image and video benchmarks, PixelRefer achieves state-of-the-art results with fewer training samples, and PixelRefer-Lite offers substantial runtime and memory savings, enabling practical deployment for real-world region-level reasoning.

Abstract

Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

TL;DR

Abstract

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)