RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model
Hantao Zhou, Tianying Ji, Lukas Sommerhalder, Michael Goerner, Norman Hendrich, Jianwei Zhang, Fuchun Sun, Huazhe Xu
TL;DR
RoboGolf tackles real-world minigolf by fusing dual-camera perception with nested closed-loop planning and a higher-level reflective equilibrium loop. A kinodynamically fine-tuned vision-language framework guides inner-loop hitting parameter estimation and route planning, while a counterfactual VLM assesses course feasibility and suggests proactive modifications. The approach is evaluated offline on a large, diverse dataset, demonstrating rapid convergence of action parameters and the ability to turn infeasible tasks into feasible ones through court redesign. This work highlights the potential of multi-modality VLMs to enable adaptive, real-world robotic decision-making with proactive task modification capabilities.
Abstract
Minigolf is an exemplary real-world game for examining embodied intelligence, requiring challenging spatial and kinodynamic understanding to putt the ball. Additionally, reflective reasoning is required if the feasibility of a challenge is not ensured. We introduce RoboGolf, a VLM-based framework that combines dual-camera perception with closed-loop action refinement, augmented by a reflective equilibrium loop. The core of both loops is powered by finetuned VLMs. We analyze the capabilities of the framework in an offline inference setting, relying on an extensive set of recorded trajectories. Exemplary demonstrations of the analyzed problem domain are available at https://jity16.github.io/RoboGolf/
