Table of Contents
Fetching ...

Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception

Jiashu Yang, Yifan Han, Yucheng Xie, Ning Guo, Wenzhao Lian

TL;DR

EyeVLA tackles the bottleneck in embodied perception by turning language-directed instructions into active camera control within a single autoregressive framework that fuses vision, language, and action tokens. It introduces a hierarchical, discrete action vocabulary and bounding-box feedback to drive pan-tilt-zoom adjustments under pixel and spatial budgets, enabling transfer of open-world VLM semantics to actionable viewpoints with only $500$ real-world samples plus synthetic data. The approach delivers strong open-world perception capabilities, validated through real-world experiments and qualitative visualizations, and shows improved task success as the system actively acquires informative evidence. While effective, it faces computational and hardware constraints of large VLMs and camera systems, with future work aimed at lightweight models, hardware acceleration, and multi-step, instruction-driven perception.

Abstract

In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. To address this issue, we propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autoregressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open-world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data. Experiments show that our system efficiently performs instructed scenes in real-world environments and actively acquires more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision system that leverages detailed and spatially rich, large-scale embodied data, and actively acquires highly informative visual observations for downstream embodied tasks.

Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception

TL;DR

EyeVLA tackles the bottleneck in embodied perception by turning language-directed instructions into active camera control within a single autoregressive framework that fuses vision, language, and action tokens. It introduces a hierarchical, discrete action vocabulary and bounding-box feedback to drive pan-tilt-zoom adjustments under pixel and spatial budgets, enabling transfer of open-world VLM semantics to actionable viewpoints with only real-world samples plus synthetic data. The approach delivers strong open-world perception capabilities, validated through real-world experiments and qualitative visualizations, and shows improved task success as the system actively acquires informative evidence. While effective, it faces computational and hardware constraints of large VLMs and camera systems, with future work aimed at lightweight models, hardware acceleration, and multi-step, instruction-driven perception.

Abstract

In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. To address this issue, we propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autoregressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open-world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data. Experiments show that our system efficiently performs instructed scenes in real-world environments and actively acquires more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision system that leverages detailed and spatially rich, large-scale embodied data, and actively acquires highly informative visual observations for downstream embodied tasks.

Paper Structure

This paper contains 15 sections, 7 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: (a): Existing vision systems with fixed RGB-D cameras cannot handle fine-grained visual information across larger spatial extents. (b): Our EyeVLA system can perceive broader and finer-grained visual information from a fixed position by rotating its viewpoint and zooming in on the target, according to instructions.
  • Figure 2: Overview of EyeVLA Pipeline. The system is built upon Qwen2.5-VL framework, integrating visual perception, language understanding, and action generation capabilities. To preserve the original semantic alignment during training, the parameters of the ViT and its projector module are kept frozen and not updated. Additionally, we introduce action tokens into the vocabulary to represent camera motions. To efficiently represent robotic actions, we further adopt a hierarchical encoding strategy to structurally model the action space.
  • Figure 3: Inference results of models trained on synthetic data generated under different strategies and iteration counts, along with their performance comparison in real-world scenarios. The figure illustrates a scenario where the goal is to identify the brand of a pen inside a box. A conventional camera, constrained by its fixed viewpoint, cannot extend into the box via a robotic arm to capture fine details. In contrast, our EyeVLA system enables clear view of the target by dynamically adjusting the camera pose and zooming in.
  • Figure 4: Flowchart of a Data Generator.
  • Figure 5: Comparison of Results from Three-Stage SFT
  • ...and 2 more figures