Table of Contents
Fetching ...

Signs of Language: Embodied Sign Language Fingerspelling Acquisition from Demonstrations for Human-Robot Interaction

Federico Tavella, Aphrodite Galata, Angelo Cangelosi

TL;DR

This paper presents SiLa, an embodied imitation-learning framework for robotic fingerspelling that learns from RGB video demonstrations without extra hardware. It builds a 15-DoF URDF hand, uses FrankMocap to obtain 3D hand poses from RGB, and trains PPO and SAC policies to imitate reference motions in a PyBullet simulation with a PD controller tuned via Bayesian optimization. Across six fingerspelled letters, SiLa achieves imitation performance comparable to motion retargeting, while providing an embodied neural representation of signs and demonstrating generalization capabilities. The work highlights practical challenges and outlines future directions, including expanding DoFs, covering the full alphabet, and leveraging motion priors and mixture-of-experts for efficiency and scalability.

Abstract

Learning fine-grained movements is a challenging topic in robotics, particularly in the context of robotic hands. One specific instance of this challenge is the acquisition of fingerspelling sign language in robots. In this paper, we propose an approach for learning dexterous motor imitation from video examples without additional information. To achieve this, we first build a URDF model of a robotic hand with a single actuator for each joint. We then leverage pre-trained deep vision models to extract the 3D pose of the hand from RGB videos. Next, using state-of-the-art reinforcement learning algorithms for motion imitation (namely, proximal policy optimization and soft actor-critic), we train a policy to reproduce the movement extracted from the demonstrations. We identify the optimal set of hyperparameters for imitation based on a reference motion. Finally, we demonstrate the generalizability of our approach by testing it on six different tasks, corresponding to fingerspelled letters. Our results show that our approach is able to successfully imitate these fine-grained movements without additional information, highlighting its potential for real-world applications in robotics.

Signs of Language: Embodied Sign Language Fingerspelling Acquisition from Demonstrations for Human-Robot Interaction

TL;DR

This paper presents SiLa, an embodied imitation-learning framework for robotic fingerspelling that learns from RGB video demonstrations without extra hardware. It builds a 15-DoF URDF hand, uses FrankMocap to obtain 3D hand poses from RGB, and trains PPO and SAC policies to imitate reference motions in a PyBullet simulation with a PD controller tuned via Bayesian optimization. Across six fingerspelled letters, SiLa achieves imitation performance comparable to motion retargeting, while providing an embodied neural representation of signs and demonstrating generalization capabilities. The work highlights practical challenges and outlines future directions, including expanding DoFs, covering the full alphabet, and leveraging motion priors and mixture-of-experts for efficiency and scalability.

Abstract

Learning fine-grained movements is a challenging topic in robotics, particularly in the context of robotic hands. One specific instance of this challenge is the acquisition of fingerspelling sign language in robots. In this paper, we propose an approach for learning dexterous motor imitation from video examples without additional information. To achieve this, we first build a URDF model of a robotic hand with a single actuator for each joint. We then leverage pre-trained deep vision models to extract the 3D pose of the hand from RGB videos. Next, using state-of-the-art reinforcement learning algorithms for motion imitation (namely, proximal policy optimization and soft actor-critic), we train a policy to reproduce the movement extracted from the demonstrations. We identify the optimal set of hyperparameters for imitation based on a reference motion. Finally, we demonstrate the generalizability of our approach by testing it on six different tasks, corresponding to fingerspelled letters. Our results show that our approach is able to successfully imitate these fine-grained movements without additional information, highlighting its potential for real-world applications in robotics.
Paper Structure (17 sections, 8 equations, 5 figures, 2 tables)

This paper contains 17 sections, 8 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: SiLa extracts 3D coordinates and rotations from RGB videos using deep models. It then trains a policy using reinforcement learning, in order to teach to our robotic hand how to imitate the reference motion.
  • Figure 2: URDF model of our robotic hand.
  • Figure 3: Error for $k_p$ and $k_d$ with maximum values equal to 1
  • Figure 4: Comparison between the reference and simulated position (top) and velocity (bottom) for the last phalanx of the index finger using the best couple of parameters.
  • Figure 5: Single step reward increase over different phases of training. Notice that the maximum number of steps for PPO is 50 millions, while for SAC is 20 millions.