Table of Contents
Fetching ...

Language-Driven Closed-Loop Grasping with Model-Predictive Trajectory Replanning

Huy Hoang Nguyen, Minh Nhat Vu, Florian Beck, Gerald Ebmer, Anh Nguyen, Andreas Kugi

TL;DR

This work tackles language-driven robotic grasping in dynamic scenes by proposing a zero-shot, modular pipeline that unifies open-vocabulary vision–language grounding, online 6D object pose localization, and receding-horizon trajectory optimization. By leveraging OWLv2 for language-grounded 2D grounding, FoundationPose for pose estimation, and MP-TrajOpt for real-time planning with explicit dynamic constraints, the approach achieves smooth, collision-free grasps of moving objects at real-time update rates. Key contributions include a three-stage pose localization pipeline with Kalman smoothing, a horizon-splitting MP-TrajOpt formulation with waypoint integration, and extensive real-world validation showing high success rates (up to 92%) and planning times under 100 ms. The framework demonstrates robust, modular integration of perception, language, and control, with practical implications for human-robot collaboration and industrial manipulation tasks, and is open-sourced for community use.

Abstract

Combining a vision module inside a closed-loop control system for a \emph{seamless movement} of a robot in a manipulation task is challenging due to the inconsistent update rates between utilized modules. This task is even more difficult in a dynamic environment, e.g., objects are moving. This paper presents a \emph{modular} zero-shot framework for language-driven manipulation of (dynamic) objects through a closed-loop control system with real-time trajectory replanning and an online 6D object pose localization. We segment an object within $\SI{0.5}{\second}$ by leveraging a vision language model via language commands. Then, guided by natural language commands, a closed-loop system, including a unified pose estimation and tracking and online trajectory planning, is utilized to continuously track this object and compute the optimal trajectory in real-time. Our proposed zero-shot framework provides a smooth trajectory that avoids jerky movements and ensures the robot can grasp a non-stationary object. Experiment results exhibit the real-time capability of the proposed zero-shot modular framework for the trajectory optimization module to accurately and efficiently grasp moving objects, i.e., up to \SI{30}{\hertz} update rates for the online 6D pose localization module and \SI{10}{\hertz} update rates for the receding-horizon trajectory optimization. These advantages highlight the modular framework's potential applications in robotics and human-robot interaction; see the video in https://www.acin.tuwien.ac.at/en/6e64/.

Language-Driven Closed-Loop Grasping with Model-Predictive Trajectory Replanning

TL;DR

This work tackles language-driven robotic grasping in dynamic scenes by proposing a zero-shot, modular pipeline that unifies open-vocabulary vision–language grounding, online 6D object pose localization, and receding-horizon trajectory optimization. By leveraging OWLv2 for language-grounded 2D grounding, FoundationPose for pose estimation, and MP-TrajOpt for real-time planning with explicit dynamic constraints, the approach achieves smooth, collision-free grasps of moving objects at real-time update rates. Key contributions include a three-stage pose localization pipeline with Kalman smoothing, a horizon-splitting MP-TrajOpt formulation with waypoint integration, and extensive real-world validation showing high success rates (up to 92%) and planning times under 100 ms. The framework demonstrates robust, modular integration of perception, language, and control, with practical implications for human-robot collaboration and industrial manipulation tasks, and is open-sourced for community use.

Abstract

Combining a vision module inside a closed-loop control system for a \emph{seamless movement} of a robot in a manipulation task is challenging due to the inconsistent update rates between utilized modules. This task is even more difficult in a dynamic environment, e.g., objects are moving. This paper presents a \emph{modular} zero-shot framework for language-driven manipulation of (dynamic) objects through a closed-loop control system with real-time trajectory replanning and an online 6D object pose localization. We segment an object within by leveraging a vision language model via language commands. Then, guided by natural language commands, a closed-loop system, including a unified pose estimation and tracking and online trajectory planning, is utilized to continuously track this object and compute the optimal trajectory in real-time. Our proposed zero-shot framework provides a smooth trajectory that avoids jerky movements and ensures the robot can grasp a non-stationary object. Experiment results exhibit the real-time capability of the proposed zero-shot modular framework for the trajectory optimization module to accurately and efficiently grasp moving objects, i.e., up to \SI{30}{\hertz} update rates for the online 6D pose localization module and \SI{10}{\hertz} update rates for the receding-horizon trajectory optimization. These advantages highlight the modular framework's potential applications in robotics and human-robot interaction; see the video in https://www.acin.tuwien.ac.at/en/6e64/.
Paper Structure (15 sections, 15 equations, 6 figures, 2 tables)

This paper contains 15 sections, 15 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the proposed zero-shot framework: The top row shows the modules of the proposed framework, which run with different update rates, i.e., the language-driven object detection module (low frequency), the object poses localization module (medium frequency), the MP-TrajOpt module (medium frequency), and the controller (very high frequency). Note that grey blocks are a closed control loop. The overlay images in the second row illustrate an execution of the experimental evaluation.
  • Figure 2: Flow chart of the language-driven object detection and object pose localization module: The vision module OWLv2 minderer2023scaling detects the object's 2D location to create a binary mask, then uses this mask with a CAD model for initial pose estimation wen2023foundationpose, refined by a Kalman filter. The light orange block indicates the language-driven object detection module, while the light green block presents the three object pose localization stages.
  • Figure 3: Illustration of the pre-grasp waypoint located above the object's pose: The $x$-, $y$-, and $z$-axes are colored in red, green, and blue, respectively.
  • Figure 4: Set of ten objects used in the robotic experiment.
  • Figure 5: (a) 3D trajectory plot. (b) Errors plot
  • ...and 1 more figures