Table of Contents
Fetching ...

Pixelis: Reasoning in Pixels, from Seeing to Acting

Yunpeng Zhou

Abstract

Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and a mild efficiency prior under a KL anchor, yielding short, valid, structured toolchains; (3) Pixel Test-Time RL performs label-free adaptation by retrieving neighbors, voting over complete trajectories rather than answers, and updating toward short, high-fidelity exemplars while constraining drift with a KL-to-EMA safety control. Across six public image and video benchmarks, Pixelis yields consistent improvements: the average relative gain is +4.08% over the same 8B baseline (peaking at +6.03% on VSI-Bench), computed as (ours-baseline)/baseline, while producing shorter, auditable toolchains and maintaining in-corridor KL during test-time learning. Acting within pixels, rather than abstract tokens, grounds multimodal perception in the physical world, linking visual reasoning with actionable outcomes, and enables embodied adaptation without external feedback.

Pixelis: Reasoning in Pixels, from Seeing to Acting

Abstract

Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and a mild efficiency prior under a KL anchor, yielding short, valid, structured toolchains; (3) Pixel Test-Time RL performs label-free adaptation by retrieving neighbors, voting over complete trajectories rather than answers, and updating toward short, high-fidelity exemplars while constraining drift with a KL-to-EMA safety control. Across six public image and video benchmarks, Pixelis yields consistent improvements: the average relative gain is +4.08% over the same 8B baseline (peaking at +6.03% on VSI-Bench), computed as (ours-baseline)/baseline, while producing shorter, auditable toolchains and maintaining in-corridor KL during test-time learning. Acting within pixels, rather than abstract tokens, grounds multimodal perception in the physical world, linking visual reasoning with actionable outcomes, and enables embodied adaptation without external feedback.

Paper Structure

This paper contains 33 sections, 31 equations, 16 figures, 18 tables.

Figures (16)

  • Figure 1: Pixelis uses executable pixel tools to act on images and videos. SFT learns tool syntax; CC-RFT shapes exploration with curiosity, coherence, and efficiency; Pixel TTRL adapts online via retrieval and trajectory voting under KL/EMA safety, producing shorter, auditable toolchains.
  • Figure 2: Three-phase training of Pixelis. SFT learns a tool-use grammar; CC-RFT shapes exploration with curiosity, coherence, and a light efficiency prior; Pixel TTRL adapts at test time by retrieving neighbours and updating toward behaviourally consistent trajectories under a KL-to-EMA safety constraint, turning raw tool traces into shorter, structured pixel toolchains.
  • Figure 3: RFT process metrics: RaPR (top) and RaCPR (bottom). We compare Answer Only, +Curiosity, +Coherence, +Curiosity+Coherence, and +Curiosity+Coherence+Penalty (Pixelis). Adding curiosity alone increases RaPR but hurts RaCPR; adding coherence and a light penalty yields the highest RaPR/RaCPR with lower variance across seeds. Bars show means over 3 seeds with 95% BCa bootstrap confidence intervals (BH-corrected).
  • Figure 4: Pixel TTRL: accuracy (top) and token-KL to EMA (bottom) within corridor $[0.10,0.20]$. Safe variant (value-aware retrieval, trajectory voting, EMA+KL) stays in-corridor; no-safety drifts and degrades. PID-controlled $\beta$ keeps KL bounded.
  • Figure 5: Qualitative comparison. The baseline often loops or over-zooms on irrelevant regions, while Pixelis forms shorter, more coherent toolchains that align with the queried evidence, reflected in higher RaPR/RaCPR and VisFid.
  • ...and 11 more figures