Table of Contents
Fetching ...

cVLA: Towards Efficient Camera-Space VLAs

Max Argus, Jelena Bratulic, Houman Masnavi, Maxim Velikanov, Nick Heppert, Abhinav Valada, Thomas Brox

TL;DR

This paper introduces cVLA, a lightweight Vision-Language-Action system that predicts two image-space end-effector keyposes in a single step, trained on synthetic data to enable efficient training and broad sim-to-real transfer. It fine-tunes a PaliGemma2 backbone, incorporates depth into prompts, and explores inference-time decoding (beam-search-NMS) and one-shot imitation via demonstration-conditioned trajectories. The authors demonstrate performance across ManiSkill3/Objaverse simulations, DROID real data, and a real Franka Panda setup, showing robust sim-to-real transfer without real-world fine-tuning. Key contributions include an efficient dataset and training pipeline, depth-enabled prompt design, a novel decoding strategy, and demonstration of one-shot imitation in both simulated and real environments. Overall, the work highlights the potential of sim-trained Vision-Language models to accelerate VLA research with reduced data and compute requirements.

Abstract

Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive performance of Vision Language Models (VLMs) on 2D images to directly infer robot end-effector poses in image frame coordinates. Unlike prior VLA models that output low-level controls, our model predicts trajectory waypoints, making it both more efficient to train and robot embodiment agnostic. Despite its lightweight design, our next-token prediction architecture effectively learns meaningful and executable robot trajectories. We further explore the underutilized potential of incorporating depth images, inference-time techniques such as decoding strategies, and demonstration-conditioned action generation. Our model is trained on a simulated dataset and exhibits strong sim-to-real transfer capabilities. We evaluate our approach using a combination of simulated and real data, demonstrating its effectiveness on a real robotic system.

cVLA: Towards Efficient Camera-Space VLAs

TL;DR

This paper introduces cVLA, a lightweight Vision-Language-Action system that predicts two image-space end-effector keyposes in a single step, trained on synthetic data to enable efficient training and broad sim-to-real transfer. It fine-tunes a PaliGemma2 backbone, incorporates depth into prompts, and explores inference-time decoding (beam-search-NMS) and one-shot imitation via demonstration-conditioned trajectories. The authors demonstrate performance across ManiSkill3/Objaverse simulations, DROID real data, and a real Franka Panda setup, showing robust sim-to-real transfer without real-world fine-tuning. Key contributions include an efficient dataset and training pipeline, depth-enabled prompt design, a novel decoding strategy, and demonstration of one-shot imitation in both simulated and real environments. Overall, the work highlights the potential of sim-trained Vision-Language models to accelerate VLA research with reduced data and compute requirements.

Abstract

Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive performance of Vision Language Models (VLMs) on 2D images to directly infer robot end-effector poses in image frame coordinates. Unlike prior VLA models that output low-level controls, our model predicts trajectory waypoints, making it both more efficient to train and robot embodiment agnostic. Despite its lightweight design, our next-token prediction architecture effectively learns meaningful and executable robot trajectories. We further explore the underutilized potential of incorporating depth images, inference-time techniques such as decoding strategies, and demonstration-conditioned action generation. Our model is trained on a simulated dataset and exhibits strong sim-to-real transfer capabilities. We evaluate our approach using a combination of simulated and real data, demonstrating its effectiveness on a real robotic system.

Paper Structure

This paper contains 24 sections, 2 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Overview of cVLA. Our lightweight method is based on fine-tuning a PaliGemma2 Steiner2024PaliGemma2A model for trajectory prediction using our curated dataset with a single image, robot state, and task description as inputs. Our synthetic training dataset is built from different simulations of pick-and-place tasks, which enables easy scaling and an efficient training pipeline. The approach shows good generalization across different application domains, including simulation, real data, and real robot setups, and offers a simpler setup for experimental research and development of VLAs.
  • Figure 2: Action representation ablation. Comparing robot and image coordinate frame action predictions, success rate on CLEVR-easy simulation, with camera frame performing better on average.
  • Figure 3: Cropping strategies comparison. Cropping can consistently improve performance, but also starts inducing failures, on DROID-hard
  • Figure 4: Exemplary motivation for decoding. We qualitatively visualize results on episode 81 of the DROID-hard dataset. The most probable beam corresponds to the red cube, but our proposed NMS-based beam decoding strategy also detects the correct target object location (blue cup).
  • Figure 5: Real-world demonstration of our approach. The top row illustrates the task of placing a spatula onto a cutting board, while the bottom row depicts the robot placing a mango onto a plate.
  • ...and 7 more figures