Table of Contents
Fetching ...

Scaling Manipulation Learning with Visual Kinematic Chain Prediction

Xinyu Zhang, Yuhan Liu, Haonan Chang, Abdeslam Boularias

TL;DR

This work introduces a universal, visually grounded action representation for diverse robots by forecasting the visual kinematic chain in image space. The Visual Kinematics Transformer (VKT) is a convolution-free, attention-based model that predicts multi-view kinematic chains and is trained with a single objective using Earth-Moving Distance with Sinkhorn matching, eliminating manual action normalization. Empirically, VKT demonstrates strong performance as both a specialized and general agent across multiple robotics benchmarks and real-robot tasks, often outperforming BC-transformers and showing robust generalization through multi-environment training. The approach enables scalable, cross-robot manipulation learning by recasting actions as visually forecasted kinematic structures, with practical implications for composable, language-conditioned robotic policies.

Abstract

Learning general-purpose models from diverse datasets has achieved great success in machine learning. In robotics, however, existing methods in multi-task learning are typically constrained to a single robot and workspace, while recent work such as RT-X requires a non-trivial action normalization procedure to manually bridge the gap between different action spaces in diverse environments. In this paper, we propose the visual kinematics chain as a precise and universal representation of quasi-static actions for robot learning over diverse environments, which requires no manual adjustment since the visual kinematic chains can be automatically obtained from the robot's model and camera parameters. We propose the Visual Kinematics Transformer (VKT), a convolution-free architecture that supports an arbitrary number of camera viewpoints, and that is trained with a single objective of forecasting kinematic structures through optimal point-set matching. We demonstrate the superior performance of VKT over BC transformers as a general agent on Calvin, RLBench, Open-X, and real robot manipulation tasks. Video demonstrations can be found at https://mlzxy.github.io/visual-kinetic-chain.

Scaling Manipulation Learning with Visual Kinematic Chain Prediction

TL;DR

This work introduces a universal, visually grounded action representation for diverse robots by forecasting the visual kinematic chain in image space. The Visual Kinematics Transformer (VKT) is a convolution-free, attention-based model that predicts multi-view kinematic chains and is trained with a single objective using Earth-Moving Distance with Sinkhorn matching, eliminating manual action normalization. Empirically, VKT demonstrates strong performance as both a specialized and general agent across multiple robotics benchmarks and real-robot tasks, often outperforming BC-transformers and showing robust generalization through multi-environment training. The approach enables scalable, cross-robot manipulation learning by recasting actions as visually forecasted kinematic structures, with practical implications for composable, language-conditioned robotic policies.

Abstract

Learning general-purpose models from diverse datasets has achieved great success in machine learning. In robotics, however, existing methods in multi-task learning are typically constrained to a single robot and workspace, while recent work such as RT-X requires a non-trivial action normalization procedure to manually bridge the gap between different action spaces in diverse environments. In this paper, we propose the visual kinematics chain as a precise and universal representation of quasi-static actions for robot learning over diverse environments, which requires no manual adjustment since the visual kinematic chains can be automatically obtained from the robot's model and camera parameters. We propose the Visual Kinematics Transformer (VKT), a convolution-free architecture that supports an arbitrary number of camera viewpoints, and that is trained with a single objective of forecasting kinematic structures through optimal point-set matching. We demonstrate the superior performance of VKT over BC transformers as a general agent on Calvin, RLBench, Open-X, and real robot manipulation tasks. Video demonstrations can be found at https://mlzxy.github.io/visual-kinetic-chain.
Paper Structure (12 sections, 1 equation, 12 figures, 7 tables)

This paper contains 12 sections, 1 equation, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Overview of the proposed framework. We use the visual kinematic chain as a universal action representation across diverse robots and setups. We propose the Visual Kinematics Transformer (VKT), an architecture based solely on attention layers, which predicts the future movements of the visual kinematic chain in images from multiple viewpoints. Our VKT is trained with the earth-moving distance as the single learning objective without knowing any low-level robot states or actions. When deployed in a specific environment, we freeze the VKT as a backbone and attach a tiny head to project the VKT output to actual robot commands.
  • Figure 2: Rendering a kinematic chain as a set of pixels in an RGB image
  • Figure 3: Overview of our proposed visual kinematics transformer (VKT). For each camera input, we encode the language instruction and RGB image as text-and-vision tokens with CLIP radford2021learning. Then, we concatenate the text tokens and the kinematics tokens as query tokens. The kinematics tokens are learned parameters. Next, the query and vision tokens are interweaved with a sequence of our proposed multi-view dual attention block. For each block, the query tokens are first updated with self-attention (orange). Then, cross attention is applied with the query tokens as queries and the vision tokens as keys and values (green). A cross-attention layer updates queries with keys and values, we use ➝ to denote queries and $\multimapdot$ to denote keys and values. Then, both the query tokens and the vision tokens are updated through cross-attentions with tokens from other camera viewpoints (blue). Next, the vision tokens are updated by cross-attention with query tokens as keys and values (green). Finally, the $T$ kinematics tokens are projected into $T$ point sets through an MLP, representing the visual kinematic chain in the current and the future $T-1$ steps. The predicted point-sets are optimized through point-set matching with the ground-truth, as shown in Equation \ref{['eq:emd']}.
  • Figure 4: Projecting the kinematics tokens into robot actions with 1D convolution.
  • Figure 5: Predicted visual kinematic chains returned by our VKT for different robots in Calvin (left), RLBench (right), and ALOHA (top right). The predicted kinematic chain of the current frame is colored in red and the forecast one for the next time-step is in blue.
  • ...and 7 more figures