Table of Contents
Fetching ...

Words2Contact: Identifying Support Contacts from Verbal Instructions Using Foundation Models

Dionis Totsila, Quentin Rouxel, Jean-Baptiste Mouret, Serena Ivaldi

TL;DR

Words2Contact tackles identifying safe and usable support contacts for humanoid robots from natural language. The approach builds a language-guided, modular pipeline—Prediction, Correction, and Control—guided by a Module Selector and powered by a mix of LLMs and VLMs, with a final SEIKO-based multi-contact controller. Across a comprehensive benchmark, smaller models coupled with task decomposition achieve performance levels close to larger models, and real-world Talos experiments validate practical viability with iterative corrections. The work demonstrates a promising path toward language-enabled teleoperation and shared autonomy, while outlining future work on online corrections, trajectory generation, and safety-aware constraints.

Abstract

This paper presents Words2Contact, a language-guided multi-contact placement pipeline leveraging large language models and vision language models. Our method is a key component for language-assisted teleoperation and human-robot cooperation, where human operators can instruct the robots where to place their support contacts before whole-body reaching or manipulation using natural language. Words2Contact transforms the verbal instructions of a human operator into contact placement predictions; it also deals with iterative corrections, until the human is satisfied with the contact location identified in the robot's field of view. We benchmark state-of-the-art LLMs and VLMs for size and performance in contact prediction. We demonstrate the effectiveness of the iterative correction process, showing that users, even naive, quickly learn how to instruct the system to obtain accurate locations. Finally, we validate Words2Contact in real-world experiments with the Talos humanoid robot, instructed by human operators to place support contacts on different locations and surfaces to avoid falling when reaching for distant objects.

Words2Contact: Identifying Support Contacts from Verbal Instructions Using Foundation Models

TL;DR

Words2Contact tackles identifying safe and usable support contacts for humanoid robots from natural language. The approach builds a language-guided, modular pipeline—Prediction, Correction, and Control—guided by a Module Selector and powered by a mix of LLMs and VLMs, with a final SEIKO-based multi-contact controller. Across a comprehensive benchmark, smaller models coupled with task decomposition achieve performance levels close to larger models, and real-world Talos experiments validate practical viability with iterative corrections. The work demonstrates a promising path toward language-enabled teleoperation and shared autonomy, while outlining future work on online corrections, trajectory generation, and safety-aware constraints.

Abstract

This paper presents Words2Contact, a language-guided multi-contact placement pipeline leveraging large language models and vision language models. Our method is a key component for language-assisted teleoperation and human-robot cooperation, where human operators can instruct the robots where to place their support contacts before whole-body reaching or manipulation using natural language. Words2Contact transforms the verbal instructions of a human operator into contact placement predictions; it also deals with iterative corrections, until the human is satisfied with the contact location identified in the robot's field of view. We benchmark state-of-the-art LLMs and VLMs for size and performance in contact prediction. We demonstrate the effectiveness of the iterative correction process, showing that users, even naive, quickly learn how to instruct the system to obtain accurate locations. Finally, we validate Words2Contact in real-world experiments with the Talos humanoid robot, instructed by human operators to place support contacts on different locations and surfaces to avoid falling when reaching for distant objects.
Paper Structure (15 sections, 9 figures, 1 table)

This paper contains 15 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Talos executes the user's verbal instructions to (1) lean on a book and (2) reach for an inaccessible cup, yet in its field of view, using our Words2Contact pipeline.
  • Figure 2: Words2Contact overview: (1) The user provides the first instruction. The Module Selector (Sec. \ref{['subsub:module_selector']}) classifies it as "Prediction". The Prediction Module (Sec. \ref{['subsub:prediction_module']}) integrates the user input and the robot's RGB data to predict a new contact point. (2) The user wants to adjust the predicted contact; their instruction is classified as "correction". The Correction Module (Sec. \ref{['subsub:correction_module']}) adjusts the previous prediction based on the new user input and the RGB data. (3) The user confirms the corrected contact location: the instruction is classified as "Confirmation". The Control Module (Sec. \ref{['subsub:control_module']}) uses the PointCloud, the initial user prompt and the desired contact location to compute the desired 3D contact task, later executed by the SEIKO Multi-Contact whole-body controller.
  • Figure 3: Some examples of system prompts that are utilized by the LLM in our modules. "+5 examples" refers to the 5 examples that are added to the system prompt, as part of the few-shot prompting technique. All the system prompts are available at: https://hucebot.github.io/words2contact_website/.
  • Figure 4: The Prediction Module (Sec. \ref{['subsub:prediction_module']}): the Prompt-Analyzer is an LLM that analyzes the user's prompt and returns a JSON file with the chain of thought, list of objects, and position type (absolute or relative). For absolute positions, a point is extracted via language-grounded segmentation. For relative positions, a language-grounded object detection VLM detects bounding box(es), used by the LLM to predict the contact point.
  • Figure 5: In the Correction Module (Sec. \ref{['subsub:correction_module']}) the LLM detects object descriptions in the user prompt, the VLM identifies their bounding boxes, and then uses them along with the current target position, interaction history, and the user's instruction to determine a new candidate contact $[i,j]$.
  • ...and 4 more figures