Table of Contents
Fetching ...

Enabling Dynamic Tracking in Vision-Language-Action Models via Time-Discrete and Time-Continuous Velocity Feedforward

Johannes Hechtl, Philipp Schmitt, Georg von Wichert, Wolfram Burgard

Abstract

While vision-language-action (VLA) models have shown great promise for robot manipulation, their deployment on rigid industrial robots remains challenging due to the inherent trade-off between compliance and responsiveness. Standard Behavior Cloning (BC) approaches predict discrete poses at low frequencies, omitting the velocity and acceleration feedforward terms typically used by low-level compliant controllers. This requires to rely on high stiffness for accurate tracking, thereby sacrificing safe contact dynamics. In this paper, we demonstrate the importance of integrating velocity feedforward terms into VLA policies to resolve this trade-off. We propose two methods for extracting velocity targets from VLAs: a time-discrete finite-difference approximation that serves as a highly effective bridge for existing models, and a continuous Cubic B-Spline action space that natively yields $C^2$ continuous trajectories for high-frequency control. Crucially, both approaches are strictly model-agnostic and compatible with any standard action-chunking architecture, requiring modifications only to teleoperation, data processing, and the low-level controller. We fine-tune the $π_{0.5}$ model and evaluate both of our approaches on a demanding, contact-rich cube-in-hole task. Our results indicate that incorporating the velocity feedforward term via finite differences significantly improves task execution speed, while the continuous B-Spline approach maintains high overall success rates and provides a foundation for smoother higher-order derivatives without compromising compliance.

Enabling Dynamic Tracking in Vision-Language-Action Models via Time-Discrete and Time-Continuous Velocity Feedforward

Abstract

While vision-language-action (VLA) models have shown great promise for robot manipulation, their deployment on rigid industrial robots remains challenging due to the inherent trade-off between compliance and responsiveness. Standard Behavior Cloning (BC) approaches predict discrete poses at low frequencies, omitting the velocity and acceleration feedforward terms typically used by low-level compliant controllers. This requires to rely on high stiffness for accurate tracking, thereby sacrificing safe contact dynamics. In this paper, we demonstrate the importance of integrating velocity feedforward terms into VLA policies to resolve this trade-off. We propose two methods for extracting velocity targets from VLAs: a time-discrete finite-difference approximation that serves as a highly effective bridge for existing models, and a continuous Cubic B-Spline action space that natively yields continuous trajectories for high-frequency control. Crucially, both approaches are strictly model-agnostic and compatible with any standard action-chunking architecture, requiring modifications only to teleoperation, data processing, and the low-level controller. We fine-tune the model and evaluate both of our approaches on a demanding, contact-rich cube-in-hole task. Our results indicate that incorporating the velocity feedforward term via finite differences significantly improves task execution speed, while the continuous B-Spline approach maintains high overall success rates and provides a foundation for smoother higher-order derivatives without compromising compliance.
Paper Structure (9 sections, 5 equations, 6 figures, 2 tables)

This paper contains 9 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of target trajectory representations (also called action chunks) for low-level controllers. Blue dots indicate time-discrete policy predictions. (a) Baseline: Standard VLA outputs yield stepwise position references without velocity feedforward. (b) Ours-1: Finite-difference approximation assumes linear interpolation between positions, providing piecewise-constant velocities. (c) Ours-2: Our proposed Cubic B-Spline action space generates $C^2$ continuous position and velocity profiles directly from model predictions.
  • Figure 2: Workflow illustrating Cubic B-Spline inference. The VLA outputs control points, which are continuously sampled by the high-frequency controller.
  • Figure 3: Experimental setup. Teleoperation is performed using two OMY-L100 devices.
  • Figure 4: Closeup of the cube-in-hole task. A strict $1\unit{mm}$ tolerance necessitates compliance.
  • Figure 5: Distribution of episode durations for teleoperation across different controller configurations. We include human performance for comparison.
  • ...and 1 more figures