Table of Contents
Fetching ...

Improving Robustness to Out-of-Distribution States in Imitation Learning via Deep Koopman-Boosted Diffusion Policy

Dianye Huang, Nassir Navab, Zhongliang Jiang

TL;DR

This work tackles the problem of reinforcement learning-style imitation from demonstrations when policies encounter out-of-distribution states, which can cause poor generalization and over-reliance on proprioception. It introduces the Deep Koopman-boosted Dual-branch Diffusion Policy (D3P), a dual-branch diffusion framework that separately handles fused visual–proprioceptive input and pure visual input, enabling recovery by the visual branch and precise manipulation by the fused branch. A Deep Koopman Operator module learns structured visual dynamics in a latent space, regularizing the visual encoder and improving temporal coherence, while a test-time loss-based aggregation module selects and blends overlapping action chunks to improve reliability. Empirically, D3P yields substantial improvements over prior diffusion-policy baselines on RLBench simulation tasks (average gains around 14–15 percentage points) and on real-world robotic manipulation tasks, demonstrating enhanced robustness to OOD states and better recovery behavior. The results highlight the potential of combining dual-modal representations, structured latent dynamics, and uncertainty-guided aggregation to advance robust, long-horizon imitation learning for robotic manipulation.

Abstract

Integrating generative models with action chunking has shown significant promise in imitation learning for robotic manipulation. However, the existing diffusion-based paradigm often struggles to capture strong temporal dependencies across multiple steps, particularly when incorporating proprioceptive input. This limitation can lead to task failures, where the policy overfits to proprioceptive cues at the expense of capturing the visually derived features of the task. To overcome this challenge, we propose the Deep Koopman-boosted Dual-branch Diffusion Policy (D3P) algorithm. D3P introduces a dual-branch architecture to decouple the roles of different sensory modality combinations. The visual branch encodes the visual observations to indicate task progression, while the fused branch integrates both visual and proprioceptive inputs for precise manipulation. Within this architecture, when the robot fails to accomplish intermediate goals, such as grasping a drawer handle, the policy can dynamically switch to execute action chunks generated by the visual branch, allowing recovery to previously observed states and facilitating retrial of the task. To further enhance visual representation learning, we incorporate a Deep Koopman Operator module that captures structured temporal dynamics from visual inputs. During inference, we use the test-time loss of the generative model as a confidence signal to guide the aggregation of the temporally overlapping predicted action chunks, thereby enhancing the reliability of policy execution. In simulation experiments across six RLBench tabletop tasks, D3P outperforms the state-of-the-art diffusion policy by an average of 14.6\%. On three real-world robotic manipulation tasks, it achieves a 15.0\% improvement. Code: https://github.com/dianyeHuang/D3P.

Improving Robustness to Out-of-Distribution States in Imitation Learning via Deep Koopman-Boosted Diffusion Policy

TL;DR

This work tackles the problem of reinforcement learning-style imitation from demonstrations when policies encounter out-of-distribution states, which can cause poor generalization and over-reliance on proprioception. It introduces the Deep Koopman-boosted Dual-branch Diffusion Policy (D3P), a dual-branch diffusion framework that separately handles fused visual–proprioceptive input and pure visual input, enabling recovery by the visual branch and precise manipulation by the fused branch. A Deep Koopman Operator module learns structured visual dynamics in a latent space, regularizing the visual encoder and improving temporal coherence, while a test-time loss-based aggregation module selects and blends overlapping action chunks to improve reliability. Empirically, D3P yields substantial improvements over prior diffusion-policy baselines on RLBench simulation tasks (average gains around 14–15 percentage points) and on real-world robotic manipulation tasks, demonstrating enhanced robustness to OOD states and better recovery behavior. The results highlight the potential of combining dual-modal representations, structured latent dynamics, and uncertainty-guided aggregation to advance robust, long-horizon imitation learning for robotic manipulation.

Abstract

Integrating generative models with action chunking has shown significant promise in imitation learning for robotic manipulation. However, the existing diffusion-based paradigm often struggles to capture strong temporal dependencies across multiple steps, particularly when incorporating proprioceptive input. This limitation can lead to task failures, where the policy overfits to proprioceptive cues at the expense of capturing the visually derived features of the task. To overcome this challenge, we propose the Deep Koopman-boosted Dual-branch Diffusion Policy (D3P) algorithm. D3P introduces a dual-branch architecture to decouple the roles of different sensory modality combinations. The visual branch encodes the visual observations to indicate task progression, while the fused branch integrates both visual and proprioceptive inputs for precise manipulation. Within this architecture, when the robot fails to accomplish intermediate goals, such as grasping a drawer handle, the policy can dynamically switch to execute action chunks generated by the visual branch, allowing recovery to previously observed states and facilitating retrial of the task. To further enhance visual representation learning, we incorporate a Deep Koopman Operator module that captures structured temporal dynamics from visual inputs. During inference, we use the test-time loss of the generative model as a confidence signal to guide the aggregation of the temporally overlapping predicted action chunks, thereby enhancing the reliability of policy execution. In simulation experiments across six RLBench tabletop tasks, D3P outperforms the state-of-the-art diffusion policy by an average of 14.6\%. On three real-world robotic manipulation tasks, it achieves a 15.0\% improvement. Code: https://github.com/dianyeHuang/D3P.

Paper Structure

This paper contains 21 sections, 13 equations, 8 figures, 6 tables, 2 algorithms.

Figures (8)

  • Figure 1: An illustration of the proposed D3P algorithm, featuring a dual-branch architecture that generates two ACs at each inference step, and an aggregation module that synthesizes the final action sequence based on the test-time loss of the generative model.
  • Figure 2: Architecture and components of the Deep Koopman-boosted Dual-branch Diffusion Policy (D3P). (a) depicts the overall architecture of the proposed method. During training, the switching module randomly selects one of two inputs for the diffusion model: (i) $\mathbf{f}_u$, the latent action representation from (b) the DKO module, or (ii) $\mathbf{f}_f$, the fused representation of the visual and proprioceptive inputs. During inference, the diffusion model generates ACs conditioned on $\mathbf{f}_u$ and $\mathbf{f}_f$, respectively. The output ACs are then aggregated by (c) the ACs aggregation module (refer to Section \ref{['subsec:acagg']} for more details).
  • Figure 3: An illustration of Diffusion Policy's performance on Open Drawer and Push Button tasks over multiple time steps when trained with (a) visual and proprioceptive data, and (b) visual data alone. For the Open Drawer task, as shown in (a), when the robotic arm fails to grasp the drawer handle, it gradually converges to and oscillates around a fixed joint configuration, indicating a lack of recovery behavior. In contrast, in (b), under the same failure condition, the robotic arm makes repeated attempts to open the drawer, demonstrating a capacity for recovery. For the Push Button task in (b), the policy trained solely on visual data fails to determine whether the button has been pressed, due to the ambiguity of visual feedback.
  • Figure 4: An example for computing the associated weights for each predicted action.
  • Figure 5: Examples of simulated tasks selected from RLBench, including: (a) opening a drawer, (b) pushing a button, (c) stacking a wine bottle, (d) sliding a block to a target zone, (e) sweeping trash into a dustpan, and (f) turning a water tap.
  • ...and 3 more figures