Table of Contents
Fetching ...

Bridging the Communication Gap: Artificial Agents Learning Sign Language through Imitation

Federico Tavella, Aphrodite Galata, Angelo Cangelosi

TL;DR

This work tackles sign language acquisition for artificial agents by learning from visual demonstrations without extra hardware. It builds a full-body URDF humanoid, uses FrankMocap to extract monocular 3D body and hand pose from RGB video, and trains a PPO-based policy to imitate observed signs with a multiplicative reward $r_t = r_t^p \cdot r_t^v \cdot r_t^e \cdot r_t^r$, where $r_t^x = e^{-k^x \varepsilon_t^x}$. The approach demonstrates learning five whole-body sign-language signs in simulation, validating the viability of vision-based imitation for embodied sign language. The results pave the way for sign-language capable agents and outline concrete future directions toward higher hand dexterity and real-world deployment.

Abstract

Artificial agents, particularly humanoid robots, interact with their environment, objects, and people using cameras, actuators, and physical presence. Their communication methods are often pre-programmed, limiting their actions and interactions. Our research explores acquiring non-verbal communication skills through learning from demonstrations, with potential applications in sign language comprehension and expression. In particular, we focus on imitation learning for artificial agents, exemplified by teaching a simulated humanoid American Sign Language. We use computer vision and deep learning to extract information from videos, and reinforcement learning to enable the agent to replicate observed actions. Compared to other methods, our approach eliminates the need for additional hardware to acquire information. We demonstrate how the combination of these different techniques offers a viable way to learn sign language. Our methodology successfully teaches 5 different signs involving the upper body (i.e., arms and hands). This research paves the way for advanced communication skills in artificial agents.

Bridging the Communication Gap: Artificial Agents Learning Sign Language through Imitation

TL;DR

This work tackles sign language acquisition for artificial agents by learning from visual demonstrations without extra hardware. It builds a full-body URDF humanoid, uses FrankMocap to extract monocular 3D body and hand pose from RGB video, and trains a PPO-based policy to imitate observed signs with a multiplicative reward , where . The approach demonstrates learning five whole-body sign-language signs in simulation, validating the viability of vision-based imitation for embodied sign language. The results pave the way for sign-language capable agents and outline concrete future directions toward higher hand dexterity and real-world deployment.

Abstract

Artificial agents, particularly humanoid robots, interact with their environment, objects, and people using cameras, actuators, and physical presence. Their communication methods are often pre-programmed, limiting their actions and interactions. Our research explores acquiring non-verbal communication skills through learning from demonstrations, with potential applications in sign language comprehension and expression. In particular, we focus on imitation learning for artificial agents, exemplified by teaching a simulated humanoid American Sign Language. We use computer vision and deep learning to extract information from videos, and reinforcement learning to enable the agent to replicate observed actions. Compared to other methods, our approach eliminates the need for additional hardware to acquire information. We demonstrate how the combination of these different techniques offers a viable way to learn sign language. Our methodology successfully teaches 5 different signs involving the upper body (i.e., arms and hands). This research paves the way for advanced communication skills in artificial agents.
Paper Structure (12 sections, 1 equation, 6 figures, 7 tables)

This paper contains 12 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of our proposal. Given a RGB video of a person performing a sign, we use a deep learning approach to extract body information and use such information to teach a simulated humanoid how to perform a specific sign (i.e., above).
  • Figure 2: Whole body model. We integrated an available body model pengDeepMimicExampleguidedDeep2018 with a hand model tavella2023signs, replicated and mirrored to obtain both left and right hands.
  • Figure 3: Body and hand keypoints extracted using FrankMocap. The body is composed of 24 keypoints, while each hand has 21 keypoints. (Both reprinted from frankimg)
  • Figure 4: Elbow dynamics with $k_d = 40$ (top) vs $8$ (bottom).
  • Figure 5: First run of 5 signs, the average cumulative reward is calculated over 10 different seeds
  • ...and 1 more figures