Table of Contents
Fetching ...

Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation

Qingwen Bu, Jia Zeng, Li Chen, Yanchao Yang, Guyue Zhou, Junchi Yan, Ping Luo, Heming Cui, Yi Ma, Hongyang Li

TL;DR

<3-5 sentence high-level summary>

Abstract

Despite significant progress in robotics and embodied AI in recent years, deploying robots for long-horizon tasks remains a great challenge. Majority of prior arts adhere to an open-loop philosophy and lack real-time feedback, leading to error accumulation and undesirable robustness. A handful of approaches have endeavored to establish feedback mechanisms leveraging pixel-level differences or pre-trained visual representations, yet their efficacy and adaptability have been found to be constrained. Inspired by classic closed-loop control systems, we propose CLOVER, a closed-loop visuomotor control framework that incorporates feedback mechanisms to improve adaptive robotic control. CLOVER consists of a text-conditioned video diffusion model for generating visual plans as reference inputs, a measurable embedding space for accurate error quantification, and a feedback-driven controller that refines actions from feedback and initiates replans as needed. Our framework exhibits notable advancement in real-world robotic tasks and achieves state-of-the-art on CALVIN benchmark, improving by 8% over previous open-loop counterparts. Code and checkpoints are maintained at https://github.com/OpenDriveLab/CLOVER.

Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation

TL;DR

<3-5 sentence high-level summary>

Abstract

Despite significant progress in robotics and embodied AI in recent years, deploying robots for long-horizon tasks remains a great challenge. Majority of prior arts adhere to an open-loop philosophy and lack real-time feedback, leading to error accumulation and undesirable robustness. A handful of approaches have endeavored to establish feedback mechanisms leveraging pixel-level differences or pre-trained visual representations, yet their efficacy and adaptability have been found to be constrained. Inspired by classic closed-loop control systems, we propose CLOVER, a closed-loop visuomotor control framework that incorporates feedback mechanisms to improve adaptive robotic control. CLOVER consists of a text-conditioned video diffusion model for generating visual plans as reference inputs, a measurable embedding space for accurate error quantification, and a feedback-driven controller that refines actions from feedback and initiates replans as needed. Our framework exhibits notable advancement in real-world robotic tasks and achieves state-of-the-art on CALVIN benchmark, improving by 8% over previous open-loop counterparts. Code and checkpoints are maintained at https://github.com/OpenDriveLab/CLOVER.
Paper Structure (20 sections, 4 equations, 17 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 4 equations, 17 figures, 6 tables, 1 algorithm.

Figures (17)

  • Figure 1: Motivation. The proposed CLOVER is inspired by the classic closed-loop control in automation systems (a). Our framework (b) employs a visual planner to predetermine a sequence of sub-goals (\ref{['sec:planner']}). Then these goals guide the policy to generate actions with an error measurement strategy (\ref{['sec:executor']}). Within the feedback loop, it automatically replans when the sub-goal is infeasible, and adapts to to the next one upon achievement (\ref{['sec:feedback']}).
  • Figure 2: Architecture of our feedback-driven policy.1) The state encoder takes in both current observation along with the synthesized sub-goal. A shared multimodal encoder generates fused RGB-D features, followed by two queries extracting informative features as the current and goal embeddings respectively. 2) The discrepancy of the two state embeddings is explicitly modeled as errors. 3) The resultant residual in error measurement is ultimately decoded to the final action.
  • Figure 3: Comparison on the measurement ability of different embeddings. We visualize the cosine distance between embeddings of observations and generated sub-goals during a roll-out process. (a) CLIP feature radford2021clip and (b) state embeddings trained without error measuring do not hold clear interrelations among frames. While (c) state embeddings obtained from our policy distribute reasonably in the latent space which benefits measuring the errors in feedback loops.
  • Figure 4: Real-world robot setting. We propose a long-horizon task encompassing three consecutive sub-tasks, where the failure of a prequel task will inevitably lead to failure of subsequent tasks. The additional single tasks are designed to validate the generalizability of CLOVER of all aspects.
  • Figure 5: Experiment setting of the generalization evaluation. We place entirely new objects absent from training, alongside the interaction object to introduce visual distraction. We test policies under dynamic conditions by randomly placing and picking up a doll to create unpredictable visual changes.
  • ...and 12 more figures