Table of Contents
Fetching ...

Cross-modality Force and Language Embeddings for Natural Human-Robot Communication

Ravi Tejwani, Karl Velazquez, John Payne, Paolo Bonato, Harry Asada

TL;DR

This work tackles the challenge of integrating verbal and tactile cues for natural human–robot interaction by proposing a cross-modality embedding that aligns force profiles with natural language in a shared latent space $\mathcal{Z} \subset \mathbb{R}^{16}$. It introduces a dual autoencoder framework with encoders for force and language and decoders for both modalities, trained under reconstruction, contrastive, and translation losses $\mathcal{L}=k_r\mathcal{L}_r+k_z\mathcal{L}_c+k_t\mathcal{L}_t$, to achieve bidirectional translation between force trajectories and phrases. The method is evaluated on data from 10 participants using both a phrase-to-force and a force-to-phrase protocol, comparing SBERT-based and binary phrase representations, and it shows that the dual autoencoder outperforms baselines by about 20–30% across key metrics with robust generalization to unseen inputs. The results highlight a trade-off: SBERT embeddings improve force reconstruction and generalization, while binary phrase encodings yield sharper linguistic alignment and in-distribution performance, informing deployment choices for real-world human-robot communication tasks. Overall, the framework lays a foundation for more intuitive, force-aware human–robot collaboration, with implications for rehabilitation therapy and other contact-rich manipulation scenarios.

Abstract

A method for cross-modality embedding of force profile and words is presented for synergistic coordination of verbal and haptic communication. When two people carry a large, heavy object together, they coordinate through verbal communication about the intended movements and physical forces applied to the object. This natural integration of verbal and physical cues enables effective coordination. Similarly, human-robot interaction could achieve this level of coordination by integrating verbal and haptic communication modalities. This paper presents a framework for embedding words and force profiles in a unified manner, so that the two communication modalities can be integrated and coordinated in a way that is effective and synergistic. Here, it will be shown that, although language and physical force profiles are deemed completely different, the two can be embedded in a unified latent space and proximity between the two can be quantified. In this latent space, a force profile and words can a) supplement each other, b) integrate the individual effects, and c) substitute in an exchangeable manner. First, the need for cross-modality embedding is addressed, and the basic architecture and key building block technologies are presented. Methods for data collection and implementation challenges will be addressed, followed by experimental results and discussions.

Cross-modality Force and Language Embeddings for Natural Human-Robot Communication

TL;DR

This work tackles the challenge of integrating verbal and tactile cues for natural human–robot interaction by proposing a cross-modality embedding that aligns force profiles with natural language in a shared latent space . It introduces a dual autoencoder framework with encoders for force and language and decoders for both modalities, trained under reconstruction, contrastive, and translation losses , to achieve bidirectional translation between force trajectories and phrases. The method is evaluated on data from 10 participants using both a phrase-to-force and a force-to-phrase protocol, comparing SBERT-based and binary phrase representations, and it shows that the dual autoencoder outperforms baselines by about 20–30% across key metrics with robust generalization to unseen inputs. The results highlight a trade-off: SBERT embeddings improve force reconstruction and generalization, while binary phrase encodings yield sharper linguistic alignment and in-distribution performance, informing deployment choices for real-world human-robot communication tasks. Overall, the framework lays a foundation for more intuitive, force-aware human–robot collaboration, with implications for rehabilitation therapy and other contact-rich manipulation scenarios.

Abstract

A method for cross-modality embedding of force profile and words is presented for synergistic coordination of verbal and haptic communication. When two people carry a large, heavy object together, they coordinate through verbal communication about the intended movements and physical forces applied to the object. This natural integration of verbal and physical cues enables effective coordination. Similarly, human-robot interaction could achieve this level of coordination by integrating verbal and haptic communication modalities. This paper presents a framework for embedding words and force profiles in a unified manner, so that the two communication modalities can be integrated and coordinated in a way that is effective and synergistic. Here, it will be shown that, although language and physical force profiles are deemed completely different, the two can be embedded in a unified latent space and proximity between the two can be quantified. In this latent space, a force profile and words can a) supplement each other, b) integrate the individual effects, and c) substitute in an exchangeable manner. First, the need for cross-modality embedding is addressed, and the basic architecture and key building block technologies are presented. Methods for data collection and implementation challenges will be addressed, followed by experimental results and discussions.

Paper Structure

This paper contains 40 sections, 9 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Physical therapist from Spaulding Rehabilitation Hospital is seen demonstrating 'hamstring curl' therapy on the patient with neurological injuries . She is instructing a patient to move 'gently forward' while providing an assistive force.
  • Figure 2: Coordinate system mapping direction words to spatial axes for interpreting force profiles.
  • Figure 3: Conceptual illustration of the desired properties of the cross-modality latent space. A pair of corresponding force profile and phrase should be located near each other, measured by a distance metric such as cosine similarity. However, force profiles and phrases that do not correspond should be positioned far away. This demonstrates that similar inputs would be close together and dissimilar inputs would be far apart in the latent space.
  • Figure 4: Conceptual illustration of the process of matching an arbitrary text input with the most semantically similar phrase using SBERT embeddings.
  • Figure 5: Examples of corresponding force profiles and phrase pairs. (a,b) Basic motions in forward and backward directions, showing dominant positive and negative y-components respectively. (c,d) Effect of adding modifiers ('softly' and 'greatly') to forward motion, demonstrating how they alter force magnitude while maintaining direction.
  • ...and 6 more figures