KoopmanFlow: Spectrally Decoupled Generative Control Policy via Koopman Structural Bias

Chengsi Yao; Ge Wang; Kai Kang; Shenhao Yan; Jiahao Yang; Fan Feng; Honghao Cai; Xianxian Zeng; Rongjun Chen; Yiming Zhao; Yatong Han; Xi Li

KoopmanFlow: Spectrally Decoupled Generative Control Policy via Koopman Structural Bias

Chengsi Yao, Ge Wang, Kai Kang, Shenhao Yan, Jiahao Yang, Fan Feng, Honghao Cai, Xianxian Zeng, Rongjun Chen, Yiming Zhao, Yatong Han, Xi Li

Abstract

Generative Control Policies (GCPs) show immense promise in robotic manipulation but struggle to simultaneously model stable global motions and high-frequency local corrections. While modern architectures extract multi-scale spatial features, their underlying Probability Flow ODEs apply a uniform temporal integration schedule. Compressed to a single step for real-time Receding Horizon Control (RHC), uniform ODE solvers mathematically smooth over sparse, high-frequency transients entangled within low-frequency steady states. To decouple these dynamics without accumulating pipelined errors, we introduce KoopmanFlow, a parameter-efficient generative policy guided by a Koopman-inspired structural inductive bias. Operating in a unified multimodal latent space with visual context, KoopmanFlow bifurcates generation at the terminal stage. Because visual conditioning occurs before spectral decomposition, both branches are visually guided yet temporally specialized. A macroscopic branch anchors slow-varying trajectories via single-step Consistency Training, while a transient branch uses Flow Matching to isolate high-frequency residuals stimulated by sudden visual cues (e.g., contacts or occlusions). Guided by an explicit spectral prior and optimized via a novel asymmetric consistency objective, KoopmanFlow establishes a fused co-training mechanism. This allows the variant branch to absorb localized dynamics without multi-stage error accumulation. Extensive experiments show KoopmanFlow significantly outperforms state-of-the-art baselines in contact-rich tasks requiring agile disturbance rejection. By trading a surplus latency buffer for a richer structural prior, KoopmanFlow achieves superior control fidelity and parameter efficiency within real-time deployment limits.

KoopmanFlow: Spectrally Decoupled Generative Control Policy via Koopman Structural Bias

Abstract

Paper Structure (14 sections, 19 equations, 12 figures, 9 tables, 2 algorithms)

This paper contains 14 sections, 19 equations, 12 figures, 9 tables, 2 algorithms.

Introduction
Related Work
Generative Control Policies for Robotic Manipulation
Deep Koopman Dynamics and Spectral Decoupling
Methodology
Diffusion Transformer and Latent Spectral Decoupling
Action Dynamics via Koopman Operator Theory
Flow Matching and Fused Co-Training
Experiment
Experimental Setup
Main Results (Simulation)
Inference Efficiency & Ablation Studies
Real-World Experiments & Latent Visualization
Conclusion and Limitations

Figures (12)

Figure 1: Latent Space Topography of Multi-Dataset Action Features. We visualize the action manifolds across multiple manipulation tasks, revealing severe topological overlap of distinct kinematic frequencies. Time-variant critical states (colored dots, representing grasping or contact) are deeply intertwined within the continuous trajectories of massive time-invariant steady states (grey dots). This severe spatial overlap highlights why standard continuous ODEs struggle: a single vector field cannot simultaneously resolve smooth inertia and high-frequency reactive corrections within the same topological neighborhood.
Figure 2: Architectural comparison of generative backbones. Left (DiT): The standard formulation using flattened observation conditioning. Middle (DiTxyan2025maniflow): An architecture utilizing cascading cross-attention layers. Right (Ours): The proposed hierarchical condition injection, where the terminal Hybrid Koopman FFN decouples the velocity field into time-invariant ($\mathbf{v}_{inv}$) and time-variant ($\mathbf{v}_{var}$) dynamic components.
Figure 3: KoopmanFlow architecture. Multi-modal features are processed via DiT and spectrally decoupled by a terminal HKFFN into time-invariant ($\mathbf{v}_{inv}$) and time-variant ($\mathbf{v}_{var}$) velocity fields. Fused Co-Training ensures exact spectral decomposition and single-step generation (NFE = 1) for real-time control.
Figure 5: Internal Ablation Studies on the Handover Block task.
Figure 6: Quantitative success rate comparison between ManiFlow and KoopmanFlow across eight real-world manipulation tasks. To maintain fairness, both models utilize a frozen R3M (ResNet-18) visual encoder from a single-camera view and are trained on 65 episodes per task. Results are averaged over 30 real-world trials under varied environmental conditions using 1 and 10 inference steps. KoopmanFlow exhibits superior performance, notably in long-horizon tasks requiring continuous TPU gripper coordination.
...and 7 more figures

KoopmanFlow: Spectrally Decoupled Generative Control Policy via Koopman Structural Bias

Abstract

KoopmanFlow: Spectrally Decoupled Generative Control Policy via Koopman Structural Bias

Authors

Abstract

Table of Contents

Figures (12)