AnchorVLA4D: an Anchor-Based Spatial-Temporal Vision-Language-Action Model for Robotic Manipulation

Juan Zhu; Zhanying Shao; Xiaoqi Li; Ethan Morgan; Jiadong Xu; Hongwei Fan; Hao Dong

AnchorVLA4D: an Anchor-Based Spatial-Temporal Vision-Language-Action Model for Robotic Manipulation

Juan Zhu, Zhanying Shao, Xiaoqi Li, Ethan Morgan, Jiadong Xu, Hongwei Fan, Hao Dong

Abstract

Since current Vision-Language-Action (VLA) systems suffer from limited spatial perception and the absence of memory throughout manipulation, we investigate visual anchors as a means to enhance spatial and temporal reasoning within VLA policies for robotic manipulation. Conventional VLAs generate actions by conditioning on a single current frame together with a language instruction. However, since the frame is encoded as a 2D image, it does not contain detailed spatial information, and the VLA similarly lacks any means to incorporate past context. As a result, it frequently forgets objects under occlusion and becomes spatially disoriented during the manipulation process. Thus, we propose AnchorVLA4D, a simple spatial-temporal VLA that augments the visual input with an anchor image to preserve the initial scene context throughout execution, and adds a lightweight spatial encoder that jointly processes the anchor and current frames to expose geometric relationships within an episode. Built on a Qwen2.5-VL backbone with a diffusion-based action head, AnchorVLA4D requires no additional sensing modalities (e.g., depth or point clouds) and introduces negligible inference overhead. Combining anchoring with a frozen pretrained spatial encoder yields further gains, realizing a 13.6% improvement on the Simpler WidowX benchmark and confirming the approach on real-world tasks, where it achieved an average success rate of 80%.

AnchorVLA4D: an Anchor-Based Spatial-Temporal Vision-Language-Action Model for Robotic Manipulation

Abstract

Paper Structure (20 sections, 1 equation, 5 figures, 5 tables)

This paper contains 20 sections, 1 equation, 5 figures, 5 tables.

INTRODUCTION
RELATED WORKS
METHODOLOGY
Preliminaries
Model Architecture
AnchorVLA4D Workflow
Training Recipe
EXPERIMENTS
Simulation Experiments
Early Retrying
Improved Spatial Awareness
Real World Experiments
ABLATION STUDIES
Performance Breakdown
Performance Gains from Adding Anchor
...and 5 more sections

Figures (5)

Figure 1: Limits of depending on a single frame in conventional VLAs. The top row illustrates forgetting caused by occlusion, while the bottom row demonstrates spatial disorientation.
Figure 2: Overall Architecture
Figure 3: More precise retries using an anchor
Figure 4: Tasks in Real-World Environments
Figure 5: Success rates for three different tasks

AnchorVLA4D: an Anchor-Based Spatial-Temporal Vision-Language-Action Model for Robotic Manipulation

Abstract

AnchorVLA4D: an Anchor-Based Spatial-Temporal Vision-Language-Action Model for Robotic Manipulation

Authors

Abstract

Table of Contents

Figures (5)