Bridging the Communication Gap: Artificial Agents Learning Sign Language through Imitation
Federico Tavella, Aphrodite Galata, Angelo Cangelosi
TL;DR
This work tackles sign language acquisition for artificial agents by learning from visual demonstrations without extra hardware. It builds a full-body URDF humanoid, uses FrankMocap to extract monocular 3D body and hand pose from RGB video, and trains a PPO-based policy to imitate observed signs with a multiplicative reward $r_t = r_t^p \cdot r_t^v \cdot r_t^e \cdot r_t^r$, where $r_t^x = e^{-k^x \varepsilon_t^x}$. The approach demonstrates learning five whole-body sign-language signs in simulation, validating the viability of vision-based imitation for embodied sign language. The results pave the way for sign-language capable agents and outline concrete future directions toward higher hand dexterity and real-world deployment.
Abstract
Artificial agents, particularly humanoid robots, interact with their environment, objects, and people using cameras, actuators, and physical presence. Their communication methods are often pre-programmed, limiting their actions and interactions. Our research explores acquiring non-verbal communication skills through learning from demonstrations, with potential applications in sign language comprehension and expression. In particular, we focus on imitation learning for artificial agents, exemplified by teaching a simulated humanoid American Sign Language. We use computer vision and deep learning to extract information from videos, and reinforcement learning to enable the agent to replicate observed actions. Compared to other methods, our approach eliminates the need for additional hardware to acquire information. We demonstrate how the combination of these different techniques offers a viable way to learn sign language. Our methodology successfully teaches 5 different signs involving the upper body (i.e., arms and hands). This research paves the way for advanced communication skills in artificial agents.
