Table of Contents
Fetching ...

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Aleksandar Yanev, Luc Van Gool, Jan-Nico Zaech, Danda Pani Paudel

TL;DR

This paper tackles the challenge of generalist robot policies failing to generalize in unseen environments and across embodiments due to reliance on 2DVision-Language Models. It introduces SPEAR-VLM, a 3D-aware VLM built on PaliGemma and augmented with a monocular depth encoder, trained on 3D-vision-Question-Answering tasks using non-robotic data (≈${230k}$ annotated images) and 3D tokens ($N=1024$). Building on this, SPEAR-1 is a robotic foundation model with a Flow Matching action expert that integrates explicit 3D spatial reasoning into end-to-end control, trained on ≈${45}$M frames across ${24}$ Open X-Embodiment datasets, achieving state-of-the-art zero-shot results on Franka (DROID) and WidowX while using ≈$20\times$ fewer demonstrations. The results, including comprehensive ablations and real-world experiments, support the claim that 3D pretraining on non-robotic data can scale robotic generalization, and the authors provide open weights and 3D-annotated datasets to accelerate future work.

Abstract

Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control. Yet their ability to generalize across new environments, tasks, and embodiments remains limited. We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world. Bridging this gap directly with large-scale robotic data is costly and difficult to scale. Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities. Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image. Building on SPEAR-VLM, we introduce our main contribution, $~\textbf{SPEAR-1}$: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control. Trained on $\sim$45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as $π_0$-FAST and $π_{0.5}$, while it uses 20$\times$ fewer robot demonstrations. This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data. We make our model weights and 3D-annotated datasets publicly available.

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

TL;DR

This paper tackles the challenge of generalist robot policies failing to generalize in unseen environments and across embodiments due to reliance on 2DVision-Language Models. It introduces SPEAR-VLM, a 3D-aware VLM built on PaliGemma and augmented with a monocular depth encoder, trained on 3D-vision-Question-Answering tasks using non-robotic data (≈ annotated images) and 3D tokens (). Building on this, SPEAR-1 is a robotic foundation model with a Flow Matching action expert that integrates explicit 3D spatial reasoning into end-to-end control, trained on ≈M frames across Open X-Embodiment datasets, achieving state-of-the-art zero-shot results on Franka (DROID) and WidowX while using ≈ fewer demonstrations. The results, including comprehensive ablations and real-world experiments, support the claim that 3D pretraining on non-robotic data can scale robotic generalization, and the authors provide open weights and 3D-annotated datasets to accelerate future work.

Abstract

Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control. Yet their ability to generalize across new environments, tasks, and embodiments remains limited. We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world. Bridging this gap directly with large-scale robotic data is costly and difficult to scale. Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities. Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image. Building on SPEAR-VLM, we introduce our main contribution, : a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control. Trained on 45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as -FAST and , while it uses 20 fewer robot demonstrations. This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data. We make our model weights and 3D-annotated datasets publicly available.

Paper Structure

This paper contains 26 sections, 10 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: (a) Most VLAs fail to show zero-shot performance on the challenging Franka (DROID) setup in unseen environments, without task or environment-specific fine-tuning. (b) SPEAR-1 operates in this challenging setup, outperforms $\pi0$-FAST pertsch2025fast and matches $\pi_{0.5}$intelligence2025pi_ on Franka (DROID) embodiment zero-shot in unseen environments while using 20$\times$ less robot demonstrations data. It also shows strong performance on WidowX (Bridge) (c) SPEAR-1 evaluation results on different embodiments and in different environments.
  • Figure 2: SPEAR-1 stages of training.Stage 0: General VLM pretraining on web scale data, e.g. PaliGemma. Stage 1: Integrate a mono depth vision encoder to build SPEAR-VLM and train it on embodied-inspired VQA tasks, e.g.3D bounding box or object-to-object distance estimation. We use 2D images from non-robotic data, enriched with 3D annotations. Stage 2: Add an action expert to train SPEAR-1 on robot demonstration data, e.g. OpenX open_x_embodiment_rt_x_2023. Each stage boosts the model's robotics-relevant knowledge and capabilities, but the abundance and diversity of data decreases.
  • Figure 3: SPEAR-VLM overview. Left: Training data mixture, auto computed spatial annotations and example question-answer pairs from each category. Right: High-level architecture with fusion between SigLIP and MoGe encoders and PaliGemma embeddings expansion with 3D tokens. This design equips SPEAR-VLM with explicit 3D understanding that serves as a strong foundation for SPEAR-VLA.
  • Figure 4: Real world evaluation on WidowX. SPEAR-1 is able to achieve 10% higher average task progress across all tasks than OpenVLA, a strong baseline in this setting. Bottom images correspond to the real-world tasks, whose performances are reported above.
  • Figure 5: Real world evaluation on Franka. We find that without any fine-tuning on the target environment, SPEAR-1 noticeably outperforms $\pi_0$-FAST, and matches $\pi_{0.5}$, even though both baselines are trained on 20$\times$ more robotic data from significantly more diverse environments. The bottom row shows challenging, varied Franka environments where SPEAR-1 maintains strong zero-shot performance.
  • ...and 1 more figures