Table of Contents
Fetching ...

KineVLA: Towards Kinematics-Aware Vision-Language-Action Models with Bi-Level Action Decomposition

Gaoge Han, Zhengqing Gao, Ziwen Li, Jiaxin Huang, Shaoli Huang, Fakhri Karray, Mingming Gong, Tongliang Liu

Abstract

In this paper, we introduce a novel kinematics-rich vision-language-action (VLA) task, in which language commands densely encode diverse kinematic attributes (such as direction, trajectory, orientation, and relative displacement) from initiation through completion, at key moments, unlike existing action instructions that capture kinematics only coarsely or partially, thereby supporting fine-grained and personalized manipulation. In this setting, where task goals remain invariant while execution trajectories must adapt to instruction-level kinematic specifications. To address this challenge, we propose KineVLA, a vision-language-action framework that explicitly decouples goal-level invariance from kinematics-level variability through a bi-level action representation and bi-level reasoning tokens to serve as explicit, supervised intermediate variables that align language and action. To support this task, we construct the kinematics-aware VLA datasets spanning both simulation and real-world robotic platforms, featuring instruction-level kinematic variations and bi-level annotations. Extensive experiments on LIBERO and a Realman-75 robot demonstrate that KineVLA consistently outperforms strong VLA baselines on kinematics-sensitive benchmarks, achieving more precise, controllable, and generalizable manipulation behaviors.

KineVLA: Towards Kinematics-Aware Vision-Language-Action Models with Bi-Level Action Decomposition

Abstract

In this paper, we introduce a novel kinematics-rich vision-language-action (VLA) task, in which language commands densely encode diverse kinematic attributes (such as direction, trajectory, orientation, and relative displacement) from initiation through completion, at key moments, unlike existing action instructions that capture kinematics only coarsely or partially, thereby supporting fine-grained and personalized manipulation. In this setting, where task goals remain invariant while execution trajectories must adapt to instruction-level kinematic specifications. To address this challenge, we propose KineVLA, a vision-language-action framework that explicitly decouples goal-level invariance from kinematics-level variability through a bi-level action representation and bi-level reasoning tokens to serve as explicit, supervised intermediate variables that align language and action. To support this task, we construct the kinematics-aware VLA datasets spanning both simulation and real-world robotic platforms, featuring instruction-level kinematic variations and bi-level annotations. Extensive experiments on LIBERO and a Realman-75 robot demonstrate that KineVLA consistently outperforms strong VLA baselines on kinematics-sensitive benchmarks, achieving more precise, controllable, and generalizable manipulation behaviors.
Paper Structure (14 sections, 7 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 14 sections, 7 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of Vanilla VLAs vs. KineVLA. Vanilla VLAs kim2025openvlawang2025vq, which accept coarse command inputs (e.g., "place the wine bottle on the drawer") and produce relatively fixed bottle label orientation actions. In contrast, Our KineVLA can processes fine kinematic commands (e.g., "control the wine bottle to face a specific orientation on the cabinet") and generate robot end-effector actions oriented front, back, left, or right.
  • Figure 2: A task example from our proposed Kinematics-Rich Datasets: In contrast to coarse task instructions that aim to complete a general goal, our approach captures diverse, fine-grained kinematic variations for an action instruction and their temporal evolution across multiple key action stages. These variations encompass {A} Object Part, {B} Action Constraint, and {C} Target Relations, with corresponding images of key actions displayed from left to right. Our KineVLA method is designed to address this challenge, excelling at perceiving these multi-faceted details to achieve precise manipulation.
  • Figure 3: Overview of our framework. Our approach decouples low-frequency, goal-level control from fine-grained kinematic refinements to effectively handle kinematics-rich tasks. (a) The proposed Bi-Level RVQ-VAE learns hierarchical action representations (Sec. \ref{['subsec:bilevel-vqvae']}), while (b) the KineVLA framework addresses kinematics-rich tasks through bi-level generation (Sec. \ref{['subsec:cot-generation']}).
  • Figure 4: Experimental Results and Comparisons. We benchmark our method across the three proposed kinematics-aware datasets, encompassing both simulation and real-world robotic experiments. The left two figures illustrate example environments, while the bar chart on the right presents the goal and kinematic success rates.
  • Figure 5: Task execution examples using KineVLA. From left to right, the figure shows the kinematics-rich instruction input, the initial state of the environment, followed by the bi-level reasoning text and action tokens generated by the KineVLA model. Next are two columns illustrating the key robot action states, and finally, the resulting final state after the robot executes the task.
  • ...and 2 more figures