Table of Contents
Fetching ...

GELATO: Multi-Instruction Trajectory Reshaping via Geometry-Aware Multiagent-based Orchestration

Junhui Huang, Yuhe Gong, Changsheng Li, Xingguang Duan, Luis Figueredo

TL;DR

The paper tackles open-vocabulary trajectory modification in human-robot interaction by introducing GELATO, a learning-free framework that combines VLM-assisted geometry registration to produce a 6D primitive scene representation with an LLM-driven constraint generator, a geometry-aware vector-field optimizer, and a multi-agent observer-refinement loop to handle multi-instruction inputs without retraining. Its key contributions include explicit geometric grounding via analytic primitives, interpretable verifiable constraints, and a robust multi-agent orchestration with observer feedback. The authors demonstrate superior safety, smoothness, and alignment with user intent compared with point-based and learning-based baselines, validated through simulations, user studies, and real-robot trials. This work significantly advances intuitive, reliable human-robot interaction in dynamic environments by integrating geometric reasoning with natural language grounding and multi-agent negotiation of conflicting objectives.

Abstract

We present GELATO -- the first language-driven trajectory reshaping framework to embed geometric environment awareness and multi-agent feedback orchestration to support multi-instruction in human-robot interaction scenarios. Unlike prior learning-based methods, our approach automatically registers scene objects as 6D geometric primitives via a VLM-assisted multi-view pipeline, and an LLM translates free-form multiple instructions into explicit, verifiable geometric constraints. These are integrated into a geometric-aware vector field optimization to adapt initial trajectories while preserving smoothness, feasibility, and clearance. We further introduce a multi-agent orchestration with observer-based refinement to handle multi-instruction inputs and interactions among objectives -- increasing success rate without retraining. Simulation and real-world experiments demonstrate our method achieves smoother, safer, and more interpretable trajectory modifications compared to state-of-the-art baselines.

GELATO: Multi-Instruction Trajectory Reshaping via Geometry-Aware Multiagent-based Orchestration

TL;DR

The paper tackles open-vocabulary trajectory modification in human-robot interaction by introducing GELATO, a learning-free framework that combines VLM-assisted geometry registration to produce a 6D primitive scene representation with an LLM-driven constraint generator, a geometry-aware vector-field optimizer, and a multi-agent observer-refinement loop to handle multi-instruction inputs without retraining. Its key contributions include explicit geometric grounding via analytic primitives, interpretable verifiable constraints, and a robust multi-agent orchestration with observer feedback. The authors demonstrate superior safety, smoothness, and alignment with user intent compared with point-based and learning-based baselines, validated through simulations, user studies, and real-robot trials. This work significantly advances intuitive, reliable human-robot interaction in dynamic environments by integrating geometric reasoning with natural language grounding and multi-agent negotiation of conflicting objectives.

Abstract

We present GELATO -- the first language-driven trajectory reshaping framework to embed geometric environment awareness and multi-agent feedback orchestration to support multi-instruction in human-robot interaction scenarios. Unlike prior learning-based methods, our approach automatically registers scene objects as 6D geometric primitives via a VLM-assisted multi-view pipeline, and an LLM translates free-form multiple instructions into explicit, verifiable geometric constraints. These are integrated into a geometric-aware vector field optimization to adapt initial trajectories while preserving smoothness, feasibility, and clearance. We further introduce a multi-agent orchestration with observer-based refinement to handle multi-instruction inputs and interactions among objectives -- increasing success rate without retraining. Simulation and real-world experiments demonstrate our method achieves smoother, safer, and more interpretable trajectory modifications compared to state-of-the-art baselines.

Paper Structure

This paper contains 17 sections, 8 equations, 10 figures, 1 algorithm.

Figures (10)

  • Figure 1: Multi-instructions trajectory adaptation based on natural language commands and 6D environment information. GELATO is the first language-guided trajectory adaptation framework to geometric-aware interactions (considering full object poses instead of keypoints) and proper safe modulation. The framework fuses object semantics with automatic geometric registration (planes, cylinders, cuboids) and safe modulation to satisfy multiple instructions. A multi-agent loop proposes alternative semantic-based strategies and an observer verifies constraint satisfaction to refine the final path.
  • Figure 2: System Architecture. (Top) The environment-grounding module segments objects (Grounding-DINO, SAM), fuses multi-view RGB-D, and fits geometric primitives (6D OBB), producing a structured scene. (Bottom) Translates language instructions into structured motion constraints using an LLM. (Right) A multi-agent system that integrates the outputs from both streams to reason over and interactively optimize an initial trajectory, producing a final path that fulfils the user's intent. The process iterates until all constraints are satisfied.
  • Figure 3: Representative results across three synthetic datasets, i.e., single-command, multi-command, and concatenated-instruction inputs. Initial trajectory (black) is modified with GELATO (red). The rightmost image shows the output results of different agents, illustrating how strategy choice (parallel vs. sequential; priority/importance) affects the final reshaped trajectory.
  • Figure 4: Comparison between LATTE (blue) and GELATO (red) on the single-command and multiple-command datasets.
  • Figure 5: The image shows the points along the trajectory and their corresponding nearest points on the object. When the distance exceeds a certain threshold, no nearest point is assigned. Black indicates the trajectory, and red indicates the nearest points.
  • ...and 5 more figures