Language-Driven Closed-Loop Grasping with Model-Predictive Trajectory Replanning
Huy Hoang Nguyen, Minh Nhat Vu, Florian Beck, Gerald Ebmer, Anh Nguyen, Andreas Kugi
TL;DR
This work tackles language-driven robotic grasping in dynamic scenes by proposing a zero-shot, modular pipeline that unifies open-vocabulary vision–language grounding, online 6D object pose localization, and receding-horizon trajectory optimization. By leveraging OWLv2 for language-grounded 2D grounding, FoundationPose for pose estimation, and MP-TrajOpt for real-time planning with explicit dynamic constraints, the approach achieves smooth, collision-free grasps of moving objects at real-time update rates. Key contributions include a three-stage pose localization pipeline with Kalman smoothing, a horizon-splitting MP-TrajOpt formulation with waypoint integration, and extensive real-world validation showing high success rates (up to 92%) and planning times under 100 ms. The framework demonstrates robust, modular integration of perception, language, and control, with practical implications for human-robot collaboration and industrial manipulation tasks, and is open-sourced for community use.
Abstract
Combining a vision module inside a closed-loop control system for a \emph{seamless movement} of a robot in a manipulation task is challenging due to the inconsistent update rates between utilized modules. This task is even more difficult in a dynamic environment, e.g., objects are moving. This paper presents a \emph{modular} zero-shot framework for language-driven manipulation of (dynamic) objects through a closed-loop control system with real-time trajectory replanning and an online 6D object pose localization. We segment an object within $\SI{0.5}{\second}$ by leveraging a vision language model via language commands. Then, guided by natural language commands, a closed-loop system, including a unified pose estimation and tracking and online trajectory planning, is utilized to continuously track this object and compute the optimal trajectory in real-time. Our proposed zero-shot framework provides a smooth trajectory that avoids jerky movements and ensures the robot can grasp a non-stationary object. Experiment results exhibit the real-time capability of the proposed zero-shot modular framework for the trajectory optimization module to accurately and efficiently grasp moving objects, i.e., up to \SI{30}{\hertz} update rates for the online 6D pose localization module and \SI{10}{\hertz} update rates for the receding-horizon trajectory optimization. These advantages highlight the modular framework's potential applications in robotics and human-robot interaction; see the video in https://www.acin.tuwien.ac.at/en/6e64/.
