Table of Contents
Fetching ...

RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models

Jiyeon Koo, Taewan Cho, Hyunjoon Kang, Eunseom Pyo, Tae Gyun Oh, Taeryang Kim, Andrew Jaeyong Choi

TL;DR

RetoVLA is introduced, an architecture designed to maintain spatial awareness in lightweight models by repurposing Register Tokens-learnable parameters originally introduced to mitigate attention artifacts in Vision Transformers for their dense representation of global spatial context.

Abstract

Vision-Language-Action (VLA) models have demonstrated robust performance across diverse robotic tasks. However, their high memory and computational demands often limit real-time deployment. While existing model compression techniques reduce the parameter footprint, they often drop in 3D spatial reasoning and scene layout understanding. This work introduces RetoVLA, an architecture designed to maintain spatial awareness in lightweight models by repurposing Register Tokens-learnable parameters originally introduced to mitigate attention artifacts in Vision Transformers. While these tokens are generally discarded once used, we repurpose them for their dense representation of global spatial context. RetoVLA integrates these recycled tokens directly into the action-planning module through a dedicated spatial context injection path. Our proposed design enables the recovery of global context without increasing the total parameter count. Real-world experiments using a 7-DOF manipulator show a 17.1%p improvement in average success rates over the baseline. Our results demonstrate that leveraging internal register tokens provides a highly effective mechanism for developing efficient, spatially-aware robotic agents. A video demonstration is available at: https://youtu.be/2CseBR-snZg

RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models

TL;DR

RetoVLA is introduced, an architecture designed to maintain spatial awareness in lightweight models by repurposing Register Tokens-learnable parameters originally introduced to mitigate attention artifacts in Vision Transformers for their dense representation of global spatial context.

Abstract

Vision-Language-Action (VLA) models have demonstrated robust performance across diverse robotic tasks. However, their high memory and computational demands often limit real-time deployment. While existing model compression techniques reduce the parameter footprint, they often drop in 3D spatial reasoning and scene layout understanding. This work introduces RetoVLA, an architecture designed to maintain spatial awareness in lightweight models by repurposing Register Tokens-learnable parameters originally introduced to mitigate attention artifacts in Vision Transformers. While these tokens are generally discarded once used, we repurpose them for their dense representation of global spatial context. RetoVLA integrates these recycled tokens directly into the action-planning module through a dedicated spatial context injection path. Our proposed design enables the recovery of global context without increasing the total parameter count. Real-world experiments using a 7-DOF manipulator show a 17.1%p improvement in average success rates over the baseline. Our results demonstrate that leveraging internal register tokens provides a highly effective mechanism for developing efficient, spatially-aware robotic agents. A video demonstration is available at: https://youtu.be/2CseBR-snZg

Paper Structure

This paper contains 29 sections, 6 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Overview of the RetoVLA architecture. A dedicated spatial pathway (dashed arrow) injects global scene context by routing Register Tokens directly into the Action Expert. Unlike standard encoders that discard these tokens post-processing, RetoVLA repurposes them to seamlessly integrate spatial and semantic features, requiring no additional parameters.
  • Figure 2: Experimental setup overview. (Top) We use a custom robot arm for seven real-world manipulation tasks. Two tasks, 'Move Bowl' and 'Close Drawer', come directly from the LIBERO benchmark. (Bottom) We also implemented five additional tasks in the simulation to match our real-world setup, alongside the two LIBERO tasks.
  • Figure 3: Comparison of RetoVLA and the SmolVLA baseline on challenging real-world tasks. (Top) RetoVLA (green) significantly outperforms the baseline (yellow). (Bottom) This gain comes from reusing Register Tokens darcet2023vision (indicated by the 'T' icon) to inject global spatial context into the Action Expert.
  • Figure 4: Causal Analysis of Register Tokens. (a) Consistent attention mass across outcomes indicates active utilization. (b) RetoVLA reduces attention on image patches, moving its attention to global context. (c) Gate value ($g$) directly changes the action output, establishing causality. (d) Randomized tokens degrade performance, confirming specific information encoding.
  • Figure 5: Reusing Register Tokens darcet2023vision enables complex 3D spatial reasoning. The baseline SmolVLA shukor2025smolvla grasps a visually similar but incorrect object, whereas RetoVLA correctly interprets the instruction "in the top drawer" by using the injected spatial context. This example highlights the model's ability to handle complex multi-step manipulation commands.
  • ...and 2 more figures