Labeling Sentences with Symbolic and Deictic Gestures via Semantic Similarity
Ariel Gjaci, Carmine Tommaso Recchiuto, Antonio Sgorbissa
TL;DR
This paper addresses the challenge of grounding Symbolic and Deictic co-speech gestures to semantic content in utterances. It introduces three rule-based labeling algorithms—Baseline, Fixed Window, and Moving Window—that assign gesture labels to word sequences by leveraging semantic similarity to gesture reference sentences computed with RoBERTa Cross-Encoder. The study grounds 12 Italian-culture gestures with expert-created reference sentences and evaluates labels against human annotations using AP, IOU, and ACT, finding that semantic similarity supports more semantically aligned gesture labeling than a purely statistical baseline. The proposed approach is data-light, scalable via offline precomputation, and designed for hybrid integration with data-driven gesture generation methods across cultures and contexts.
Abstract
Co-speech gesture generation on artificial agents has gained attention recently, mainly when it is based on data-driven models. However, end-to-end methods often fail to generate co-speech gestures related to semantics with specific forms, i.e., Symbolic and Deictic gestures. In this work, we identify which words in a sentence are contextually related to Symbolic and Deictic gestures. Firstly, we appropriately chose 12 gestures recognized by people from the Italian culture, which different humanoid robots can reproduce. Then, we implemented two rule-based algorithms to label sentences with Symbolic and Deictic gestures. The rules depend on the semantic similarity scores computed with the RoBerta model between sentences that heuristically represent gestures and sub-sentences inside an objective sentence that artificial agents have to pronounce. We also implemented a baseline algorithm that assigns gestures without computing similarity scores. Finally, to validate the results, we asked 30 persons to label a set of sentences with Deictic and Symbolic gestures through a Graphical User Interface (GUI), and we compared the labels with the ones produced by our algorithms. For this scope, we computed Average Precision (AP) and Intersection Over Union (IOU) scores, and we evaluated the Average Computational Time (ACT). Our results show that semantic similarity scores are useful for finding Symbolic and Deictic gestures in utterances.
