Table of Contents
Fetching ...

RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model

Hantao Zhou, Tianying Ji, Lukas Sommerhalder, Michael Goerner, Norman Hendrich, Jianwei Zhang, Fuchun Sun, Huazhe Xu

TL;DR

RoboGolf tackles real-world minigolf by fusing dual-camera perception with nested closed-loop planning and a higher-level reflective equilibrium loop. A kinodynamically fine-tuned vision-language framework guides inner-loop hitting parameter estimation and route planning, while a counterfactual VLM assesses course feasibility and suggests proactive modifications. The approach is evaluated offline on a large, diverse dataset, demonstrating rapid convergence of action parameters and the ability to turn infeasible tasks into feasible ones through court redesign. This work highlights the potential of multi-modality VLMs to enable adaptive, real-world robotic decision-making with proactive task modification capabilities.

Abstract

Minigolf is an exemplary real-world game for examining embodied intelligence, requiring challenging spatial and kinodynamic understanding to putt the ball. Additionally, reflective reasoning is required if the feasibility of a challenge is not ensured. We introduce RoboGolf, a VLM-based framework that combines dual-camera perception with closed-loop action refinement, augmented by a reflective equilibrium loop. The core of both loops is powered by finetuned VLMs. We analyze the capabilities of the framework in an offline inference setting, relying on an extensive set of recorded trajectories. Exemplary demonstrations of the analyzed problem domain are available at https://jity16.github.io/RoboGolf/

RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model

TL;DR

RoboGolf tackles real-world minigolf by fusing dual-camera perception with nested closed-loop planning and a higher-level reflective equilibrium loop. A kinodynamically fine-tuned vision-language framework guides inner-loop hitting parameter estimation and route planning, while a counterfactual VLM assesses course feasibility and suggests proactive modifications. The approach is evaluated offline on a large, diverse dataset, demonstrating rapid convergence of action parameters and the ability to turn infeasible tasks into feasible ones through court redesign. This work highlights the potential of multi-modality VLMs to enable adaptive, real-world robotic decision-making with proactive task modification capabilities.

Abstract

Minigolf is an exemplary real-world game for examining embodied intelligence, requiring challenging spatial and kinodynamic understanding to putt the ball. Additionally, reflective reasoning is required if the feasibility of a challenge is not ensured. We introduce RoboGolf, a VLM-based framework that combines dual-camera perception with closed-loop action refinement, augmented by a reflective equilibrium loop. The core of both loops is powered by finetuned VLMs. We analyze the capabilities of the framework in an offline inference setting, relying on an extensive set of recorded trajectories. Exemplary demonstrations of the analyzed problem domain are available at https://jity16.github.io/RoboGolf/
Paper Structure (28 sections, 12 figures)

This paper contains 28 sections, 12 figures.

Figures (12)

  • Figure 1: Conceptual Overview. Our system integrates dual-camera scene perception, an inner action refinement loop predicting hit parameters, and an outer reasoning loop assessing course feasibility. An RGB-D camera captures the arranged spatial scene and ball trajectories are tracked using an event camera. The inner loop derives hitting parameters, adjusting them through evaluation of failed attempts. The outer reflective loop uses counterfactual reasoning to suggest course modifications in unsolvable scenarios.
  • Figure 2: Perception module with dual-camera setup. $\bullet$Spatial Information: RGB-D camera to capture details of the minigolf course. $\blacktriangle$Topography Process: use SAM to process RGB images, combine with depth information to generate accurate topography and court key points. The pathway is reconstructed by matching the positions of the segmented objects. $\bullet$Dynamic Tracking: use RGB-D camera and event camera to record the movement of the high-speed golf ball. $\blacktriangle$Trajectory Process: employ RGB-D videos as prompts for SAM to reduce noise in event-camera videos and to distinguish the trajectories of the golf club and golf balls.
  • Figure 3: Hardware and setup. The setup features a UR5 robotic arm with a 3D-printed connector, a Robotiq gripper, and a golf club. Perception hardware includes an event camera and a depth camera. The court includes 11 distinct object types, allowing varied minigolf configurations.
  • Figure 4: Dataset rollouts on courts of increasing complexity. The dotted lines show the trajectory of successful hits, while the marks below indicate the success or failure to infer the correct action parameters in each inner loop iteration. The system identifies the correct action parameters in each case within few trials.
  • Figure 5: Key frames of the successful demonstration in the billiard challenge. The objective is to hit the red ball to bump the white ball into the yellow disk target. For visual clarity, the two balls are enclosed in boxes.
  • ...and 7 more figures