Improving Robustness to Out-of-Distribution States in Imitation Learning via Deep Koopman-Boosted Diffusion Policy
Dianye Huang, Nassir Navab, Zhongliang Jiang
TL;DR
This work tackles the problem of reinforcement learning-style imitation from demonstrations when policies encounter out-of-distribution states, which can cause poor generalization and over-reliance on proprioception. It introduces the Deep Koopman-boosted Dual-branch Diffusion Policy (D3P), a dual-branch diffusion framework that separately handles fused visual–proprioceptive input and pure visual input, enabling recovery by the visual branch and precise manipulation by the fused branch. A Deep Koopman Operator module learns structured visual dynamics in a latent space, regularizing the visual encoder and improving temporal coherence, while a test-time loss-based aggregation module selects and blends overlapping action chunks to improve reliability. Empirically, D3P yields substantial improvements over prior diffusion-policy baselines on RLBench simulation tasks (average gains around 14–15 percentage points) and on real-world robotic manipulation tasks, demonstrating enhanced robustness to OOD states and better recovery behavior. The results highlight the potential of combining dual-modal representations, structured latent dynamics, and uncertainty-guided aggregation to advance robust, long-horizon imitation learning for robotic manipulation.
Abstract
Integrating generative models with action chunking has shown significant promise in imitation learning for robotic manipulation. However, the existing diffusion-based paradigm often struggles to capture strong temporal dependencies across multiple steps, particularly when incorporating proprioceptive input. This limitation can lead to task failures, where the policy overfits to proprioceptive cues at the expense of capturing the visually derived features of the task. To overcome this challenge, we propose the Deep Koopman-boosted Dual-branch Diffusion Policy (D3P) algorithm. D3P introduces a dual-branch architecture to decouple the roles of different sensory modality combinations. The visual branch encodes the visual observations to indicate task progression, while the fused branch integrates both visual and proprioceptive inputs for precise manipulation. Within this architecture, when the robot fails to accomplish intermediate goals, such as grasping a drawer handle, the policy can dynamically switch to execute action chunks generated by the visual branch, allowing recovery to previously observed states and facilitating retrial of the task. To further enhance visual representation learning, we incorporate a Deep Koopman Operator module that captures structured temporal dynamics from visual inputs. During inference, we use the test-time loss of the generative model as a confidence signal to guide the aggregation of the temporally overlapping predicted action chunks, thereby enhancing the reliability of policy execution. In simulation experiments across six RLBench tabletop tasks, D3P outperforms the state-of-the-art diffusion policy by an average of 14.6\%. On three real-world robotic manipulation tasks, it achieves a 15.0\% improvement. Code: https://github.com/dianyeHuang/D3P.
