Table of Contents
Fetching ...

TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication

Petr Vanc, Karla Stepanova

TL;DR

TransforMerger tackles robust human-robot communication by fusing uncertain voice and gesture inputs through a probabilistic, transformer-based reasoning pipeline. It merges modalities into a unified probabilistic sequence and grounds references in a scene representation, using soft embeddings and a parametrized system prompt to enable context-aware reasoning with instruction-following LLMs. The approach demonstrates improved robustness to noise, misalignment, and incomplete data across simulated and real-world tabletop tasks, often outperforming deterministic baselines and single-modality interpretations. By grounding actions in scene context and leveraging probabilistic modality representations, TransforMerger offers a scalable path toward more natural and reliable multimodal HRI. The work provides extensive datasets, code, and model comparisons, highlighting the practical impact of probabilistic merging and context-aware reasoning for complex human-robot collaboration scenarios.

Abstract

As human-robot collaboration advances, natural and flexible communication methods are essential for effective robot control. Traditional methods relying on a single modality or rigid rules struggle with noisy or misaligned data as well as with object descriptions that do not perfectly fit the predefined object names (e.g. 'Pick that red object'). We introduce TransforMerger, a transformer-based reasoning model that infers a structured action command for robotic manipulation based on fused voice and gesture inputs. Our approach merges multimodal data into a single unified sentence, which is then processed by the language model. We employ probabilistic embeddings to handle uncertainty and we integrate contextual scene understanding to resolve ambiguous references (e.g., gestures pointing to multiple objects or vague verbal cues like "this"). We evaluate TransforMerger in simulated and real-world experiments, demonstrating its robustness to noise, misalignment, and missing information. Our results show that TransforMerger outperforms deterministic baselines, especially in scenarios requiring more contextual knowledge, enabling more robust and flexible human-robot communication. Code and datasets are available at: http://imitrob.ciirc.cvut.cz/publications/transformerger.

TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication

TL;DR

TransforMerger tackles robust human-robot communication by fusing uncertain voice and gesture inputs through a probabilistic, transformer-based reasoning pipeline. It merges modalities into a unified probabilistic sequence and grounds references in a scene representation, using soft embeddings and a parametrized system prompt to enable context-aware reasoning with instruction-following LLMs. The approach demonstrates improved robustness to noise, misalignment, and incomplete data across simulated and real-world tabletop tasks, often outperforming deterministic baselines and single-modality interpretations. By grounding actions in scene context and leveraging probabilistic modality representations, TransforMerger offers a scalable path toward more natural and reliable multimodal HRI. The work provides extensive datasets, code, and model comparisons, highlighting the practical impact of probabilistic merging and context-aware reasoning for complex human-robot collaboration scenarios.

Abstract

As human-robot collaboration advances, natural and flexible communication methods are essential for effective robot control. Traditional methods relying on a single modality or rigid rules struggle with noisy or misaligned data as well as with object descriptions that do not perfectly fit the predefined object names (e.g. 'Pick that red object'). We introduce TransforMerger, a transformer-based reasoning model that infers a structured action command for robotic manipulation based on fused voice and gesture inputs. Our approach merges multimodal data into a single unified sentence, which is then processed by the language model. We employ probabilistic embeddings to handle uncertainty and we integrate contextual scene understanding to resolve ambiguous references (e.g., gestures pointing to multiple objects or vague verbal cues like "this"). We evaluate TransforMerger in simulated and real-world experiments, demonstrating its robustness to noise, misalignment, and missing information. Our results show that TransforMerger outperforms deterministic baselines, especially in scenarios requiring more contextual knowledge, enabling more robust and flexible human-robot communication. Code and datasets are available at: http://imitrob.ciirc.cvut.cz/publications/transformerger.

Paper Structure

This paper contains 32 sections, 10 equations, 7 figures.

Figures (7)

  • Figure 1: Human-Robot Interaction Pipeline. A user communicates tasks through hand gestures ($\mathcal{S}_G$), captured via a hand sensor, and voice commands ($\mathcal{S}_V$), recorded by a microphone. A camera monitors the scene to get scene objects ($\mathcal{O}$). Our solution utilizes a transformer-based SOTA Large Language model (1-3B param., running offline) to reason about the user's intent and generate clear action commands for the robot to execute.
  • Figure 2: System architecture for real world experiments semantic reasoner from Fig. \ref{['fig:intro']}. System is merging multimodal inputs into a single Skill Command, a high-level instruction for a robot to execute. The blue blocks highlight the paper contributions. In the simulated setup the $\mathcal{S}_G$ and $\mathcal{S}_V$ are simulated by created dataset, see Sec. \ref{['sec:artificial_dataset']}.
  • Figure 3: Example simulated inputs from gesture and language and the result of their merging (see Sec. \ref{['sec:merging_algorithm']}).
  • Figure 4: System prompt: Parameterized scene-aware prompt for the reasoning model, incorporating structured reasoning steps, model's role and required output. Available actions, objects, and scene descriptions are dynamically inserted as parameters for each specific task (example in Fig. \ref{['fig:example_paraemters']}).
  • Figure 5: Example task parameters (objects, actions, and scene description) inserted to the system prompt (Fig. \ref{['fig:reasoning_prompt']}) for one of the experiments.
  • ...and 2 more figures