EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation

Samarth Chopra; Alex McMoil; Ben Carnovale; Evan Sokolson; Rajkumar Kubendran; Samuel Dickerson

EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation

Samarth Chopra, Alex McMoil, Ben Carnovale, Evan Sokolson, Rajkumar Kubendran, Samuel Dickerson

TL;DR

EveryDayVLA tackles the cost and brittleness barriers of robotic VLA systems by integrating a budget $~$300, 6-DOF manipulator with a unified vision–language–action model that jointly predicts discrete and continuous actions. A novel AdaHorizon adaptive-horizon ensemble monitors disagreement between action heads to trigger replanning in real time, enabling safe, reliable manipulation in clutter and in the wild. The approach uses a Prismatic-7B VLM with SigLIP and DinoV2 encoders and a Llama 2 backbone, trained with a combined cross-entropy and L1 objective, and fine-tuned on a 1,200-demo dataset collected with the low-cost arm. Across LIBERO simulation and real-world tests, EverydayVLA matches or exceeds state-of-the-art baselines, delivering up to $49\%$ gains in-distribution and $34.9\%$ gains out-of-distribution, while achieving high throughput and reduced hardware barriers for broader adoption of robotic foundation models.

Abstract

While Vision-Language-Action (VLA) models map visual inputs and language instructions directly to robot actions, they often rely on costly hardware and struggle in novel or cluttered scenes. We introduce EverydayVLA, a 6-DOF manipulator that can be assembled for under $300, capable of modest payloads and workspace. A single unified model jointly outputs discrete and continuous actions, and our adaptive-horizon ensemble monitors motion uncertainty to trigger on-the-fly re-planning for safe, reliable operation. On LIBERO, EverydayVLA matches state-of-the-art success rates, and in real-world tests it outperforms prior methods by 49% in-distribution and 34.9% out-of-distribution. By combining a state-of-the-art VLA with cost-effective hardware, EverydayVLA democratizes access to a robotic foundation model and paves the way for economical use in homes and research labs alike. Experiment videos and details: https://everydayvla.github.io/

EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation

TL;DR

EveryDayVLA tackles the cost and brittleness barriers of robotic VLA systems by integrating a budget

300, 6-DOF manipulator with a unified vision–language–action model that jointly predicts discrete and continuous actions. A novel AdaHorizon adaptive-horizon ensemble monitors disagreement between action heads to trigger replanning in real time, enabling safe, reliable manipulation in clutter and in the wild. The approach uses a Prismatic-7B VLM with SigLIP and DinoV2 encoders and a Llama 2 backbone, trained with a combined cross-entropy and L1 objective, and fine-tuned on a 1,200-demo dataset collected with the low-cost arm. Across LIBERO simulation and real-world tests, EverydayVLA matches or exceeds state-of-the-art baselines, delivering up to

gains in-distribution and

gains out-of-distribution, while achieving high throughput and reduced hardware barriers for broader adoption of robotic foundation models.

EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation

TL;DR

Abstract

EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)