SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

Mohamad Alansari; Naufal Suryanto; Divya Velayudhan; Sajid Javed; Naoufel Werghi; Muzammal Naseer

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

Mohamad Alansari, Naufal Suryanto, Divya Velayudhan, Sajid Javed, Naoufel Werghi, Muzammal Naseer

Abstract

Multimodal large language models (MLLMs) have advanced from image-level reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame-wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is supported by a curated referential video dataset of 30,646 videos and 45,231 Q&A pairs and operates end-to-end without external detectors via a class-agnostic SAM2-based proposer. Integrated into three recent open-source video MLLMs (UniPixel, GLUS, and VideoGLaMM), SPARROW delivers consistent gains across six benchmarks, improving up to +8.9 J&F on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG. These results demonstrate that SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel-grounded video understanding. Project page: https://risys-lab.github.io/SPARROW

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

Abstract

Paper Structure (68 sections, 13 equations, 18 figures, 15 tables)

This paper contains 68 sections, 13 equations, 18 figures, 15 tables.

Introduction
Related Work
From Language Models to Grounded MLLMs
Language and Multimodal Foundations.
Grounded and Referring Understanding in Images.
Video MLLMs for Grounded Segmentation
Video MLLMs.
Our Positioning.
Methodology
Framework Overview
Architecture.
Visual Feature Encoding.
Target-Specific Features
Motivation.
Tracking and Selection.
...and 53 more sections

Figures (18)

Figure 1: Comparison of temporal consistency and initialization quality in video object segmentation. (a) The baseline method videoglamm suffers from temporal drift, leading to inconsistent segmentation of the same object across frames. (b) Noisy or unstable initialization propagates segmentation errors through subsequent frames. (c) Our proposed Target-Specific Tracked Feature mitigates drift by maintaining consistent object grounding over time. (d) The Dual-Prompt Initialization strategy improves segmentation precision and stability during early frames.
Figure 2: SPARROW pipeline. Given a video and text prompt, spatial clip and temporal internvl2 encoders feed V$\!\to$L adapters and a LoRA‑tuned LLM. The LLM emits [BOX] and [SEG] tokens which are projected (L$\!\to$V) to condition a class‑agnostic proposer and the SAM2 sam2 pixel decoder. Dashed green modules (GroundingDINO groundingdino, CLDTracker cldtracker, target cropping, K‑means) are pre-computed offline as pseudo-supervision, used only for target-specific information injection step, and are removed at test time by default.
Figure 3: Illustrative process of the dual-prompt initialization. Given a query (e.g., "Can you segment the truck?"), our module first generates class-agnostic proposals, which are filtered by the [BOX] prompt and then refined by the [SEG] prompt to produce precise segmentation masks. The dual-prompt approach provides tighter localization and sharper boundaries compared with using [SEG] only.
Figure 4: Qualitative comparison on the Referring Video Object Segmentation (RVOS) task. (Left) Dog scene:SPARROW accurately segments “dog … to the right,” “dog … to the left,” and “woman in a yellow jacket” from the first frame, preserving their identities across motion and overlap. (Right) Crowd scene:SPARROW cleanly separates “the woman in yellow,” “the woman in red,” and “the woman in black,” maintaining consistent masks and sharp boundaries under occlusion and scale variation. In contrast, UniPixel unipixel and GLUS glus exhibit early inaccuracies, mask swaps, and boundary artifacts across both scenes. The dual-prompt initialization enables precise first-frame grounding and referentially consistent segmentation, producing temporally stable masks throughout the sequence.
Figure 5: Visual grounding on VidSTG (interrogative). SPARROW boosts mIoU by +5.49 on UniPixel (41.25$\rightarrow$46.74), +5.25 on GLUS (29.92$\rightarrow$35.17), and +5.40 on VideoGLaMM (39.66$\rightarrow$45.06), consistently improving spatial and mask quality.
...and 13 more figures

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

Abstract

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

Authors

Abstract

Table of Contents

Figures (18)