Table of Contents
Fetching ...

KOROL: Learning Visualizable Object Feature with Koopman Operator Rollout for Manipulation

Hongyi Chen, Abulikemu Abuduweili, Aviral Agrawal, Yunhai Han, Harish Ravichandar, Changliu Liu, Jeffrey Ichnowski

TL;DR

KOROL tackles vision-based manipulation without ground-truth object states by learning visual object features from RGBD images and coupling them with a Koopman operator rollout to predict robot trajectories. The method jointly trains a feature extractor and a finite-dimensional Koopman operator, updating the operator periodically to reflect feature changes and using frequency-domain (DCT) augmentation to improve feature discrimination. It demonstrates superior sample efficiency and performance over baselines on simulated and real-world tasks, and extends Koopman dynamics to cross-task manipulation via a universal object feature interface. The work provides interpretable object-feature visualizations via CAM and highlights practical avenues for deploying Koopman-based planning in vision-based robotic manipulation. Practical impact includes more data-efficient, generalizable manipulation pipelines that operate without GT object states, enabling real-world applicability and multi-task transfer.

Abstract

Learning dexterous manipulation skills presents significant challenges due to complex nonlinear dynamics that underlie the interactions between objects and multi-fingered hands. Koopman operators have emerged as a robust method for modeling such nonlinear dynamics within a linear framework. However, current methods rely on runtime access to ground-truth (GT) object states, making them unsuitable for vision-based practical applications. Unlike image-to-action policies that implicitly learn visual features for control, we use a dynamics model, specifically the Koopman operator, to learn visually interpretable object features critical for robotic manipulation within a scene. We construct a Koopman operator using object features predicted by a feature extractor and utilize it to auto-regressively advance system states. We train the feature extractor to embed scene information into object features, thereby enabling the accurate propagation of robot trajectories. We evaluate our approach on simulated and real-world robot tasks, with results showing that it outperformed the model-based imitation learning NDP by 1.08$\times$ and the image-to-action Diffusion Policy by 1.16$\times$. The results suggest that our method maintains task success rates with learned features and extends applicability to real-world manipulation without GT object states. Project video and code are available at: \url{https://github.com/hychen-naza/KOROL}.

KOROL: Learning Visualizable Object Feature with Koopman Operator Rollout for Manipulation

TL;DR

KOROL tackles vision-based manipulation without ground-truth object states by learning visual object features from RGBD images and coupling them with a Koopman operator rollout to predict robot trajectories. The method jointly trains a feature extractor and a finite-dimensional Koopman operator, updating the operator periodically to reflect feature changes and using frequency-domain (DCT) augmentation to improve feature discrimination. It demonstrates superior sample efficiency and performance over baselines on simulated and real-world tasks, and extends Koopman dynamics to cross-task manipulation via a universal object feature interface. The work provides interpretable object-feature visualizations via CAM and highlights practical avenues for deploying Koopman-based planning in vision-based robotic manipulation. Practical impact includes more data-efficient, generalizable manipulation pipelines that operate without GT object states, enabling real-world applicability and multi-task transfer.

Abstract

Learning dexterous manipulation skills presents significant challenges due to complex nonlinear dynamics that underlie the interactions between objects and multi-fingered hands. Koopman operators have emerged as a robust method for modeling such nonlinear dynamics within a linear framework. However, current methods rely on runtime access to ground-truth (GT) object states, making them unsuitable for vision-based practical applications. Unlike image-to-action policies that implicitly learn visual features for control, we use a dynamics model, specifically the Koopman operator, to learn visually interpretable object features critical for robotic manipulation within a scene. We construct a Koopman operator using object features predicted by a feature extractor and utilize it to auto-regressively advance system states. We train the feature extractor to embed scene information into object features, thereby enabling the accurate propagation of robot trajectories. We evaluate our approach on simulated and real-world robot tasks, with results showing that it outperformed the model-based imitation learning NDP by 1.08 and the image-to-action Diffusion Policy by 1.16. The results suggest that our method maintains task success rates with learned features and extends applicability to real-world manipulation without GT object states. Project video and code are available at: \url{https://github.com/hychen-naza/KOROL}.
Paper Structure (39 sections, 7 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 39 sections, 7 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: left: Vanilla Koopman operators rely on ground-truth state which may be difficult to obtain in real-world settings. right: In contrast, we propose KOROL, which learns a dynamics model and task-relevant object features without labels of object states. The visualization shows the localization of learned feature around the door handle.
  • Figure 2: Training and Execution Pipeline. During training, KOROL updates the feature extractor $f_\theta$ based on the loss between the predicted robot trajectory $\hat{\tau}_r = [\hat{\mathrm{x}}_r(1), \hat{\mathrm{x}}_r(2), \cdots, \hat{\mathrm{x}}_r(T)]$ obtained through Koopman operator rollouts and the ground-truth robot trajectory $\tau_r = [\mathrm{x}_r(1), \mathrm{x}_r(2), \cdots, \mathrm{x}_r(T)]$. KOROL updates the Koopman operator with the new object features $\hat{\mathrm{x}}_o(t)$ every $M$ epochs to enhance the training of $f_\theta$. During execution, KOROL feeds the generated trajectory to the inverse dynamics controller to produce the actions.
  • Figure 3: Visualization of Object Features Using Class Activation Mapping (CAM) zhou2016learning. The sequence from top to bottom illustrates the tasks of door opening, tool use, relocation, and reorientation, while from left to right shows the execution of each task.
  • Figure 4: Training and Validation Loss Curves in Door Task. The dashed line indicates the times of updating $\mathbf{K}$.
  • Figure 5: Visualization of Object Features Using CAM in Three Real-World Tasks. From top to bottom, the sequence showcases training images from various trials of toy relocation, teapot pickup, and cube insertion tasks, demonstrating the feature extractor's generalization to positional variance.
  • ...and 3 more figures