Table of Contents
Fetching ...

Labeling Sentences with Symbolic and Deictic Gestures via Semantic Similarity

Ariel Gjaci, Carmine Tommaso Recchiuto, Antonio Sgorbissa

TL;DR

This paper addresses the challenge of grounding Symbolic and Deictic co-speech gestures to semantic content in utterances. It introduces three rule-based labeling algorithms—Baseline, Fixed Window, and Moving Window—that assign gesture labels to word sequences by leveraging semantic similarity to gesture reference sentences computed with RoBERTa Cross-Encoder. The study grounds 12 Italian-culture gestures with expert-created reference sentences and evaluates labels against human annotations using AP, IOU, and ACT, finding that semantic similarity supports more semantically aligned gesture labeling than a purely statistical baseline. The proposed approach is data-light, scalable via offline precomputation, and designed for hybrid integration with data-driven gesture generation methods across cultures and contexts.

Abstract

Co-speech gesture generation on artificial agents has gained attention recently, mainly when it is based on data-driven models. However, end-to-end methods often fail to generate co-speech gestures related to semantics with specific forms, i.e., Symbolic and Deictic gestures. In this work, we identify which words in a sentence are contextually related to Symbolic and Deictic gestures. Firstly, we appropriately chose 12 gestures recognized by people from the Italian culture, which different humanoid robots can reproduce. Then, we implemented two rule-based algorithms to label sentences with Symbolic and Deictic gestures. The rules depend on the semantic similarity scores computed with the RoBerta model between sentences that heuristically represent gestures and sub-sentences inside an objective sentence that artificial agents have to pronounce. We also implemented a baseline algorithm that assigns gestures without computing similarity scores. Finally, to validate the results, we asked 30 persons to label a set of sentences with Deictic and Symbolic gestures through a Graphical User Interface (GUI), and we compared the labels with the ones produced by our algorithms. For this scope, we computed Average Precision (AP) and Intersection Over Union (IOU) scores, and we evaluated the Average Computational Time (ACT). Our results show that semantic similarity scores are useful for finding Symbolic and Deictic gestures in utterances.

Labeling Sentences with Symbolic and Deictic Gestures via Semantic Similarity

TL;DR

This paper addresses the challenge of grounding Symbolic and Deictic co-speech gestures to semantic content in utterances. It introduces three rule-based labeling algorithms—Baseline, Fixed Window, and Moving Window—that assign gesture labels to word sequences by leveraging semantic similarity to gesture reference sentences computed with RoBERTa Cross-Encoder. The study grounds 12 Italian-culture gestures with expert-created reference sentences and evaluates labels against human annotations using AP, IOU, and ACT, finding that semantic similarity supports more semantically aligned gesture labeling than a purely statistical baseline. The proposed approach is data-light, scalable via offline precomputation, and designed for hybrid integration with data-driven gesture generation methods across cultures and contexts.

Abstract

Co-speech gesture generation on artificial agents has gained attention recently, mainly when it is based on data-driven models. However, end-to-end methods often fail to generate co-speech gestures related to semantics with specific forms, i.e., Symbolic and Deictic gestures. In this work, we identify which words in a sentence are contextually related to Symbolic and Deictic gestures. Firstly, we appropriately chose 12 gestures recognized by people from the Italian culture, which different humanoid robots can reproduce. Then, we implemented two rule-based algorithms to label sentences with Symbolic and Deictic gestures. The rules depend on the semantic similarity scores computed with the RoBerta model between sentences that heuristically represent gestures and sub-sentences inside an objective sentence that artificial agents have to pronounce. We also implemented a baseline algorithm that assigns gestures without computing similarity scores. Finally, to validate the results, we asked 30 persons to label a set of sentences with Deictic and Symbolic gestures through a Graphical User Interface (GUI), and we compared the labels with the ones produced by our algorithms. For this scope, we computed Average Precision (AP) and Intersection Over Union (IOU) scores, and we evaluated the Average Computational Time (ACT). Our results show that semantic similarity scores are useful for finding Symbolic and Deictic gestures in utterances.
Paper Structure (14 sections, 7 equations, 7 figures, 2 tables)

This paper contains 14 sections, 7 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Tiago (on the left side) and Alter-ego (on the right side) robots.
  • Figure 2: Diagram representing the approach. Given a set of $K$ gestures $g_i$, each represented by a set of reference sentences$S_{i}$ and an objective sentence$S_{obj}$, a labeling algorithm is used to produce labels for $S_{obj}$. In the given sentence, three labels are produced for three different sequences of words, forming $W_p = \langle\langle\textit{Hey}\rangle_{\textit{Greet}}, \langle\textit{I'm, so, sorry}\rangle_{\textit{I apologize}}, \\ \langle \textit{Can, you, forgive, me}\rangle_{\textit{I beg you}}\rangle$.
  • Figure 3: Practical example of the Baseline algorithm. Labels are not assigned depending on similarity scores but depending on a predefined label distribution.
  • Figure 4: Practical example of the Fixed Window algorithm. Window sizes have a fixed value for each gesture. All similarities are computed starting from the words pointed by arrows. The non-labeled words have a similarity score less than $th_0$ for all the gestures, so they are skipped.
  • Figure 5: Practical example of the Moving Window algorithm. The labeling process is similar to the Fixed Window algorithm, however, this time window sizes are not fixed and can have any value between $1$ and $w_{max}$.
  • ...and 2 more figures