Table of Contents
Fetching ...

EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation

Samarth Chopra, Alex McMoil, Ben Carnovale, Evan Sokolson, Rajkumar Kubendran, Samuel Dickerson

TL;DR

EveryDayVLA tackles the cost and brittleness barriers of robotic VLA systems by integrating a budget $~$300, 6-DOF manipulator with a unified vision–language–action model that jointly predicts discrete and continuous actions. A novel AdaHorizon adaptive-horizon ensemble monitors disagreement between action heads to trigger replanning in real time, enabling safe, reliable manipulation in clutter and in the wild. The approach uses a Prismatic-7B VLM with SigLIP and DinoV2 encoders and a Llama 2 backbone, trained with a combined cross-entropy and L1 objective, and fine-tuned on a 1,200-demo dataset collected with the low-cost arm. Across LIBERO simulation and real-world tests, EverydayVLA matches or exceeds state-of-the-art baselines, delivering up to $49\%$ gains in-distribution and $34.9\%$ gains out-of-distribution, while achieving high throughput and reduced hardware barriers for broader adoption of robotic foundation models.

Abstract

While Vision-Language-Action (VLA) models map visual inputs and language instructions directly to robot actions, they often rely on costly hardware and struggle in novel or cluttered scenes. We introduce EverydayVLA, a 6-DOF manipulator that can be assembled for under $300, capable of modest payloads and workspace. A single unified model jointly outputs discrete and continuous actions, and our adaptive-horizon ensemble monitors motion uncertainty to trigger on-the-fly re-planning for safe, reliable operation. On LIBERO, EverydayVLA matches state-of-the-art success rates, and in real-world tests it outperforms prior methods by 49% in-distribution and 34.9% out-of-distribution. By combining a state-of-the-art VLA with cost-effective hardware, EverydayVLA democratizes access to a robotic foundation model and paves the way for economical use in homes and research labs alike. Experiment videos and details: https://everydayvla.github.io/

EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation

TL;DR

EveryDayVLA tackles the cost and brittleness barriers of robotic VLA systems by integrating a budget 300, 6-DOF manipulator with a unified vision–language–action model that jointly predicts discrete and continuous actions. A novel AdaHorizon adaptive-horizon ensemble monitors disagreement between action heads to trigger replanning in real time, enabling safe, reliable manipulation in clutter and in the wild. The approach uses a Prismatic-7B VLM with SigLIP and DinoV2 encoders and a Llama 2 backbone, trained with a combined cross-entropy and L1 objective, and fine-tuned on a 1,200-demo dataset collected with the low-cost arm. Across LIBERO simulation and real-world tests, EverydayVLA matches or exceeds state-of-the-art baselines, delivering up to gains in-distribution and gains out-of-distribution, while achieving high throughput and reduced hardware barriers for broader adoption of robotic foundation models.

Abstract

While Vision-Language-Action (VLA) models map visual inputs and language instructions directly to robot actions, they often rely on costly hardware and struggle in novel or cluttered scenes. We introduce EverydayVLA, a 6-DOF manipulator that can be assembled for under $300, capable of modest payloads and workspace. A single unified model jointly outputs discrete and continuous actions, and our adaptive-horizon ensemble monitors motion uncertainty to trigger on-the-fly re-planning for safe, reliable operation. On LIBERO, EverydayVLA matches state-of-the-art success rates, and in real-world tests it outperforms prior methods by 49% in-distribution and 34.9% out-of-distribution. By combining a state-of-the-art VLA with cost-effective hardware, EverydayVLA democratizes access to a robotic foundation model and paves the way for economical use in homes and research labs alike. Experiment videos and details: https://everydayvla.github.io/

Paper Structure

This paper contains 17 sections, 5 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: EveryDayVLA system. Top: EveryDayVLA finetunes a VLA for a low-cost manipulator to generate continuous and discrete actions, which are passed to an adaptive horizon ensembler, to produce adaptive-sized action chunks, accounting for model uncertainty. Bottom: Our model is able to show high in-distribution and out-of-distribution performance on real world tasks, and our action ensembler (AdaHorizon) beats other state-of-art action ensemblers.
  • Figure 2: EveryDayVLA architecture. The VLA takes as input an image and natural language instruction and these are tokenized via the vision and language encoders, and sent to the Llama 2 LLM, which produces continuous and discrete actions. These actions are then passed to the adaptive horizon ensembler, which computes the difference between the two actions, executing those only below a certain threshold.
  • Figure 3: EveryDayVLA hardware. The robot consists of 7 joints, including a base and claw gripper as the end-effector. In sum, the hardware costs $311.98, affording 6 DOF, a payload of 0.2 kg, 382 mm reach, a max speed of 0.7 m/s and a repeatability of withing 10 mm.
  • Figure 4: Real-world evaluation results on in-distribution tasks, including picking a block, ball and rock. Our model is able to beat state-of-the art models on tasks and environments present in the training set by 49% on average. We evaluate on three different objects, with three instructions each, "pick and place {away, left, right}". Our experiments show better success rates on almost every single task.
  • Figure 5: Static and dynamic distractors. Top: We benchmark our model with static distractors, and a cluttered scene where we add different objects, and vary the arrangement after every single trial. Bottom: We benchmark with dynamic distrators, where a human walks in the scene and moves to distract the model.