Table of Contents
Fetching ...

A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars

Ronglai Zuo, Fangyun Wei, Zenggui Chen, Brian Mak, Jiaolong Yang, Xin Tong

TL;DR

This work tackles Spoken2Sign translation by proposing a practical three‑stage baseline that translates spoken language to sign language and renders the result with a 3D avatar. It builds a gloss–video dictionary from Sign2Spoken benchmarks, estimates 3D signs for dictionary entries with the SMPLSign‑X model, and uses a Text2Gloss (mBART) plus a sign connector to stitch retrieved 3D signs into coherent translations. The system achieves state‑of‑the‑art back‑translation BLEU scores on Phoenix‑2014T and CSL‑Daily and introduces two by‑products—3D keypoint augmentation and multi‑view understanding—that improve sign‑language understanding from keypoints. By releasing gloss dictionaries and demonstrating robust 3D sign rendering, the work offers a functional, view‑independent Spoken2Sign pipeline with practical impact for deaf–hearing communication.

Abstract

The objective of this paper is to develop a functional system for translating spoken languages into sign languages, referred to as Spoken2Sign translation. The Spoken2Sign task is orthogonal and complementary to traditional sign language to spoken language (Sign2Spoken) translation. To enable Spoken2Sign translation, we present a simple baseline consisting of three steps: 1) creating a gloss-video dictionary using existing Sign2Spoken benchmarks; 2) estimating a 3D sign for each sign video in the dictionary; 3) training a Spoken2Sign model, which is composed of a Text2Gloss translator, a sign connector, and a rendering module, with the aid of the yielded gloss-3D sign dictionary. The translation results are then displayed through a sign avatar. As far as we know, we are the first to present the Spoken2Sign task in an output format of 3D signs. In addition to its capability of Spoken2Sign translation, we also demonstrate that two by-products of our approach-3D keypoint augmentation and multi-view understanding-can assist in keypoint-based sign language understanding. Code and models are available at https://github.com/FangyunWei/SLRT.

A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars

TL;DR

This work tackles Spoken2Sign translation by proposing a practical three‑stage baseline that translates spoken language to sign language and renders the result with a 3D avatar. It builds a gloss–video dictionary from Sign2Spoken benchmarks, estimates 3D signs for dictionary entries with the SMPLSign‑X model, and uses a Text2Gloss (mBART) plus a sign connector to stitch retrieved 3D signs into coherent translations. The system achieves state‑of‑the‑art back‑translation BLEU scores on Phoenix‑2014T and CSL‑Daily and introduces two by‑products—3D keypoint augmentation and multi‑view understanding—that improve sign‑language understanding from keypoints. By releasing gloss dictionaries and demonstrating robust 3D sign rendering, the work offers a functional, view‑independent Spoken2Sign pipeline with practical impact for deaf–hearing communication.

Abstract

The objective of this paper is to develop a functional system for translating spoken languages into sign languages, referred to as Spoken2Sign translation. The Spoken2Sign task is orthogonal and complementary to traditional sign language to spoken language (Sign2Spoken) translation. To enable Spoken2Sign translation, we present a simple baseline consisting of three steps: 1) creating a gloss-video dictionary using existing Sign2Spoken benchmarks; 2) estimating a 3D sign for each sign video in the dictionary; 3) training a Spoken2Sign model, which is composed of a Text2Gloss translator, a sign connector, and a rendering module, with the aid of the yielded gloss-3D sign dictionary. The translation results are then displayed through a sign avatar. As far as we know, we are the first to present the Spoken2Sign task in an output format of 3D signs. In addition to its capability of Spoken2Sign translation, we also demonstrate that two by-products of our approach-3D keypoint augmentation and multi-view understanding-can assist in keypoint-based sign language understanding. Code and models are available at https://github.com/FangyunWei/SLRT.
Paper Structure (15 sections, 8 equations, 11 figures, 8 tables)

This paper contains 15 sections, 8 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Prior works have presented Spoken2Sign translation results through either (a) keypoint sequences saunders2021continuous or (b) 2D videos saunders2022signing. In contrast, we utilize a (c) 3D avatar to display the translation results, enabling the visualization of results from any viewpoint.
  • Figure 2: Overview of our methodology. It consists of (a) dictionary construction; (b) 3D sign estimation; (c) Spoken2Sign translation. Sign videos are from Phoenix-2014T 2014T, a German sign language benchmark.
  • Figure 3: Illustration of the sign connector. The objective is to predict the duration of the co-articulation between two adjacent 3D signs, $S_{k-1}$ and $S_k$, followed by generating the co-articulation through interpolation in the 3D joint space.
  • Figure 4: Qualitative results on P-2014T 2014T (a and b) and CSL zhou2021improving (c). In each sub-figure, we display the text in the caption, and show the ground truth sign video and our translation result in the first row and second row, respectively.
  • Figure 5: Qualitative results on Phoenix-2014T 2014T. In each sub-figure, we display the text in the caption, and show the ground truth sign video and our translation result in the first row and second row, respectively. We translate German into English.
  • ...and 6 more figures