Habilis-$β$: A Fast-Motion and Long-Lasting On-Device Vision-Language-Action Model

Tommoro Robotics; :; Jesoon Kang; Taegeon Park; Jisu An; Soo Min Kimm; Jaejoon Kim; Jinu Pahk; Byungju Kim; Junseok Lee; Namheon Baek; Sungwan Ha; Hojun Baek; Eduardo Ayerve Cruz; Wontae Kim; Junghyeon Choi; Yousuk Lee; Joonmo Han; Sunghyun Cho; Sunghyun Kwon; Soyoung Lee; Jun Ki Lee; Seung-Joon Yi; Byoung-Tak Zhang; Theo Taeyeong Kim

Habilis-$β$: A Fast-Motion and Long-Lasting On-Device Vision-Language-Action Model

Tommoro Robotics, :, Jesoon Kang, Taegeon Park, Jisu An, Soo Min Kimm, Jaejoon Kim, Jinu Pahk, Byungju Kim, Junseok Lee, Namheon Baek, Sungwan Ha, Hojun Baek, Eduardo Ayerve Cruz, Wontae Kim, Junghyeon Choi, Yousuk Lee, Joonmo Han, Sunghyun Cho, Sunghyun Kwon, Soyoung Lee, Jun Ki Lee, Seung-Joon Yi, Byoung-Tak Zhang, Theo Taeyeong Kim

TL;DR

The Productivity-Reliability Plane (PRP) is introduced, which evaluates performance through Tasks per Hour (TPH) and Mean Time Between Intervention (MTBI) under a continuous-run protocol that demands both high-speed execution and sustained robustness.

Abstract

We introduce Habilis-$β$, a fast-motion and long-lasting on-device vision-language-action (VLA) model designed for real-world deployment. Current VLA evaluation remains largely confined to single-trial success rates under curated resets, which fails to capture the fast-motion and long-lasting capabilities essential for practical operation. To address this, we introduce the Productivity-Reliability Plane (PRP), which evaluates performance through Tasks per Hour (TPH) and Mean Time Between Intervention (MTBI) under a continuous-run protocol that demands both high-speed execution and sustained robustness. Habilis-$β$ achieves high performance by integrating language-free pre-training on large-scale play data for robust interaction priors with post-training on cyclic task demonstrations that capture state drift across consecutive task iterations. The system further employs ESPADA for phase-adaptive motion shaping to accelerate free-space transit, utilizes rectified-flow distillation to enable high-frequency control on edge devices, and incorporates classifier-free guidance (CFG) as a deployment-time knob to dynamically balance instruction adherence and learned interaction priors. In 1-hour continuous-run evaluations, Habilis-$β$ achieves strong performance under the PRP metrics, compared to $π_{0.5}$ in both simulation and real-world environments. In simulation, Habilis-$β$ achieves 572.6 TPH and 39.2 s MTBI (vs. 120.5 TPH and 30.5 s for $π_{0.5}$), while in a real-world humanoid logistics workflow it achieves 124 TPH and 137.4 s MTBI (vs. 19 TPH and 46.1 s for $π_{0.5}$). Finally, Habilis-$β$ achieves the highest reported performance on the standard RoboTwin 2.0 leaderboard across representative tasks, validating its effectiveness in complex manipulation scenarios.

Habilis-$β$: A Fast-Motion and Long-Lasting On-Device Vision-Language-Action Model

TL;DR

Abstract

We introduce Habilis-

, a fast-motion and long-lasting on-device vision-language-action (VLA) model designed for real-world deployment. Current VLA evaluation remains largely confined to single-trial success rates under curated resets, which fails to capture the fast-motion and long-lasting capabilities essential for practical operation. To address this, we introduce the Productivity-Reliability Plane (PRP), which evaluates performance through Tasks per Hour (TPH) and Mean Time Between Intervention (MTBI) under a continuous-run protocol that demands both high-speed execution and sustained robustness. Habilis-

achieves high performance by integrating language-free pre-training on large-scale play data for robust interaction priors with post-training on cyclic task demonstrations that capture state drift across consecutive task iterations. The system further employs ESPADA for phase-adaptive motion shaping to accelerate free-space transit, utilizes rectified-flow distillation to enable high-frequency control on edge devices, and incorporates classifier-free guidance (CFG) as a deployment-time knob to dynamically balance instruction adherence and learned interaction priors. In 1-hour continuous-run evaluations, Habilis-

achieves strong performance under the PRP metrics, compared to

in both simulation and real-world environments. In simulation, Habilis-

achieves 572.6 TPH and 39.2 s MTBI (vs. 120.5 TPH and 30.5 s for

), while in a real-world humanoid logistics workflow it achieves 124 TPH and 137.4 s MTBI (vs. 19 TPH and 46.1 s for

). Finally, Habilis-

achieves the highest reported performance on the standard RoboTwin 2.0 leaderboard across representative tasks, validating its effectiveness in complex manipulation scenarios.

Paper Structure (59 sections, 9 equations, 10 figures, 2 tables)

This paper contains 59 sections, 9 equations, 10 figures, 2 tables.

Introduction
Problem Setup and Metrics
Deployment Protocol: Continuous-Run Evaluation
Productivity Metric: Tasks per Hour (TPH)
Reliability Metric: Mean Time Between Intervention (MTBI)
Productivity-Reliability Plane
Habilis-$\beta$: Fast-Motion, Long-Lasting, On-Device VLA
System Overview
Model Architecture and Training
Flow Matching Action Expert
Rectified Flow Distillation
Classifier-Free Guidance (CFG)
High-Frequency Control
Data Strategy: From Play to Task
Data collection interfaces
...and 44 more sections

Figures (10)

Figure 1: Habilis-$\boldsymbol{\beta}$ Overview.(Left)Training Pipeline: The model is trained in three stages to achieve Fast-Motion and Long-Lasting capabilities. Stage 1 learns a robust, task-agnostic interaction prior via play data pre-training. Stage 2 post-trains on cyclic task demonstrations, utilizing Spatially Aware Downsampling (ESPADA) to compress casual free-space motions. Stage 3 distills the multi-step flow matching action expert into an efficient rectified flow model. (Right)Inference Pipeline: To enable On-Device operation, a pre-trained VLM prefix fuses multimodal observations to condition the distilled action expert. The reduced inference cost is reinvested into High-Frequency Control, using shorter action chunks for rapid closed-loop reactivity. Finally, Classifier-Free Guidance (CFG) acts as a deployment-time knob to dynamically balance instruction adherence and learned interaction priors.
Figure 2: Model Architecture. Habilis-$\beta$'s prefix-suffix architecture uses a pre-trained VLM prefix to process multimodal inputs, conditioning a suffix action expert that generates continuous action chunks.
Figure 3: Simulation Task Setup. Simulation tasks used in our continuous-run benchmark are (top to bottom): Dump Bin Bigbin (DBB), Place Dual Shoes (PDS), and Stack Bowls Three (SBT). Detailed task procedures are described in the Simulation Task paragraph below.
Figure 4: Simulation Per-Task Results. We break down the continuous-run performance metrics (TPH, MTBI, and Success Rate) for each simulation task. Habilis-$\beta$ consistently outperforms baselines across all tasks, with ESPADA significantly boosting throughput (TPH) while maintaining high success rates.
Figure 5: Simulation Productivity-Reliability Plane. We plot the deployment performance of different methods in simulation. Habilis-$\beta$ (w/o ESPADA) achieves the highest reliability (MTBI), while the full Habilis-$\beta$ dramatically increases throughput (TPH), offering a configurable trade-off for different operational needs.
...and 5 more figures

Habilis-$β$: A Fast-Motion and Long-Lasting On-Device Vision-Language-Action Model

TL;DR

Abstract

Habilis-$β$: A Fast-Motion and Long-Lasting On-Device Vision-Language-Action Model

Authors

TL;DR

Abstract

Table of Contents

Figures (10)