Table of Contents
Fetching ...

PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model

Wenqi Liang, Gan Sun, Yao He, Jiahua Dong, Suyan Dai, Ivan Laptev, Salman Khan, Yang Cong

TL;DR

The first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs is introduced, built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompting encoder.

Abstract

Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image-text-action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings. To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompting encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data. Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by 10.1%-17.8% over OpenVLA, while requiring only 1.5% of its pretraining cost. These results demonstrate that PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments. The dataset and code will be released as open source.

PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model

TL;DR

The first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs is introduced, built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompting encoder.

Abstract

Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image-text-action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings. To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompting encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data. Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by 10.1%-17.8% over OpenVLA, while requiring only 1.5% of its pretraining cost. These results demonstrate that PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments. The dataset and code will be released as open source.

Paper Structure

This paper contains 18 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: We introduce PixelVLA, a vision–language–action (VLA) model designed for pixel-level reasoning and multimodal prompting. Unlike prior VLA models (a), which primarily rely on image-level understanding for manipulation and depend solely on textual instructions, PixelVLA (b) advances beyond these limitations by enabling fine-grained pixel-level comprehension and supporting both textual and visual prompts. This paradigm effectively enhances spatial precision and expands human–robot interaction, leading to superior performance (c) compared to baseline methods.
  • Figure 2: Overview of the PixelVLA architecture. The model integrates three novel components: (1) a visual prompting encoder for processing input diverse visual prompts; (2) a multiscale pixel-aware encoder that injects pixel-level information into token embeddings; and (3) a continuous action decoder to predict 7D robot actions. PixelVLA enhances fine-grained pixel-level spatial understanding and multimodal prompt responsiveness, enabling more precise manipulation policies in visually complex scenarios.
  • Figure 3: Overview of the Pixel-160K Dataset.
  • Figure 4: Performance comparison of OpenVLA, TraceVLA and PixelVLA performance across various environmental variations on SimplerEnv-Google Robot setup: camera orientations, lighting, background, distractors, and table texture.
  • Figure 5: The expisode example in our Pixel-160K dataset.
  • ...and 2 more figures