Table of Contents
Fetching ...

Tell and show: Combining multiple modalities to communicate manipulation tasks to a robot

Petr Vanc, Radoslav Skoviera, Karla Stepanova

TL;DR

This work tackles robust human–robot interaction by fusing gestures and language with contextual scene information to infer manipulation intents. It introduces a merging algorithm augmented with diagonal cross-entropy–based belief weighting and feasibility penalties that account for action parameters and object properties. An adaptive entropy-based thresholding mechanism governs when to execute actions or query users, with extensive ablations showing improved resilience to noise and misalignment across simulated and real datasets. The approach demonstrates strong robustness and adaptability, offering practical benefits for natural, context-aware human–robot collaboration.

Abstract

As human-robot collaboration is becoming more widespread, there is a need for a more natural way of communicating with the robot. This includes combining data from several modalities together with the context of the situation and background knowledge. Current approaches to communication typically rely only on a single modality or are often very rigid and not robust to missing, misaligned, or noisy data. In this paper, we propose a novel method that takes inspiration from sensor fusion approaches to combine uncertain information from multiple modalities and enhance it with situational awareness (e.g., considering object properties or the scene setup). We first evaluate the proposed solution on simulated bimodal datasets (gestures and language) and show by several ablation experiments the importance of various components of the system and its robustness to noisy, missing, or misaligned observations. Then we implement and evaluate the model on the real setup. In human-robot interaction, we must also consider whether the selected action is probable enough to be executed or if we should better query humans for clarification. For these purposes, we enhance our model with adaptive entropy-based thresholding that detects the appropriate thresholds for different types of interaction showing similar performance as fine-tuned fixed thresholds.

Tell and show: Combining multiple modalities to communicate manipulation tasks to a robot

TL;DR

This work tackles robust human–robot interaction by fusing gestures and language with contextual scene information to infer manipulation intents. It introduces a merging algorithm augmented with diagonal cross-entropy–based belief weighting and feasibility penalties that account for action parameters and object properties. An adaptive entropy-based thresholding mechanism governs when to execute actions or query users, with extensive ablations showing improved resilience to noise and misalignment across simulated and real datasets. The approach demonstrates strong robustness and adaptability, offering practical benefits for natural, context-aware human–robot collaboration.

Abstract

As human-robot collaboration is becoming more widespread, there is a need for a more natural way of communicating with the robot. This includes combining data from several modalities together with the context of the situation and background knowledge. Current approaches to communication typically rely only on a single modality or are often very rigid and not robust to missing, misaligned, or noisy data. In this paper, we propose a novel method that takes inspiration from sensor fusion approaches to combine uncertain information from multiple modalities and enhance it with situational awareness (e.g., considering object properties or the scene setup). We first evaluate the proposed solution on simulated bimodal datasets (gestures and language) and show by several ablation experiments the importance of various components of the system and its robustness to noisy, missing, or misaligned observations. Then we implement and evaluate the model on the real setup. In human-robot interaction, we must also consider whether the selected action is probable enough to be executed or if we should better query humans for clarification. For these purposes, we enhance our model with adaptive entropy-based thresholding that detects the appropriate thresholds for different types of interaction showing similar performance as fine-tuned fixed thresholds.
Paper Structure (29 sections, 10 equations, 8 figures, 1 table)

This paper contains 29 sections, 10 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Human-Robot Interaction experimental setup. The user's speech is captured by the microphone and the hand is captured by a hand detection device (e.g. Leap Motion Controller Weichert_Bachmann_Rudak_Fisseler_2013).
  • Figure 2: Diagram of the proposed model for the case of two modalities (hand gestures and natural language) specifying action with one parameter (target object). Heard sentence "Unglue a cup" is correctly resolved into "Pick a cup" based on a fusion of data from both modalities and task and scene context".
  • Figure 3: Real experimental setup. (left) Set of all objects used in the real experiment. (right) Example of the setup with 3 objects (box, can, and cleaner) and two storage areas (drawer, bowl) for instructions "Put the can into the drawer". See the attached video and project website\ref{['projectwebsiteref']} for more examples.
  • Figure 4: Different levels of noise added to simulated data.
  • Figure 5: Ablation study shows perfomance of the proposed model ($M_3$) compared to models without individual penalization functions ($M_2$, $M_1$) and towards the baseline. The baseline corresponds to the merging of modalities by $argmax$ function without any penalization terms. The results are shown for aligned ($D_\mathcal{A}^{sim}$) and unaligned ($D_\mathcal{U}^{sim}$) simulated datasets as well as on the real datasets ($D_\mathcal{A}^{real}$, $D_\mathcal{U}^{real}$). Models $M_1$, $M_2$, and $M_3$ used add merging function and entropy thresholding. Real noise $n^{real}_1$ was added to both simulated datasets.
  • ...and 3 more figures