Table of Contents
Fetching ...

LLMs are Good Sign Language Translators

Jia Gong, Lin Geng Foo, Yixuan He, Hossein Rahmani, Jun Liu

TL;DR

This work addresses the challenge of translating sign language videos to spoken language by leveraging off-the-shelf, frozen large language models (LLMs). It introduces SignLLM, a framework that regularizes sign videos into language-like representations through two modules: Vector-Quantized Visual Sign (VQ-Sign), which converts sign videos into discrete character-level tokens, and Codebook Reconstruction and Alignment (CRA), which builds word-level tokens via an optimal-transport formulation and aligns them with text via a sign-text alignment loss. By feeding the resulting language-like sign sentences into a frozen LLM with a prompt, SignLLM achieves state-of-the-art gloss-free SLT results on Phoenix-2014T and CSL-Daily, without fine-tuning the LLM. The approach demonstrates the viability of harnessing LLMs for SLT through carefully designed tokenization and alignment strategies, offering a data-efficient pathway to cross-modal translation and potential applicability across languages and datasets.

Abstract

Sign Language Translation (SLT) is a challenging task that aims to translate sign videos into spoken language. Inspired by the strong translation capabilities of large language models (LLMs) that are trained on extensive multilingual text corpora, we aim to harness off-the-shelf LLMs to handle SLT. In this paper, we regularize the sign videos to embody linguistic characteristics of spoken language, and propose a novel SignLLM framework to transform sign videos into a language-like representation for improved readability by off-the-shelf LLMs. SignLLM comprises two key modules: (1) The Vector-Quantized Visual Sign module converts sign videos into a sequence of discrete character-level sign tokens, and (2) the Codebook Reconstruction and Alignment module converts these character-level tokens into word-level sign representations using an optimal transport formulation. A sign-text alignment loss further bridges the gap between sign and text tokens, enhancing semantic compatibility. We achieve state-of-the-art gloss-free results on two widely-used SLT benchmarks.

LLMs are Good Sign Language Translators

TL;DR

This work addresses the challenge of translating sign language videos to spoken language by leveraging off-the-shelf, frozen large language models (LLMs). It introduces SignLLM, a framework that regularizes sign videos into language-like representations through two modules: Vector-Quantized Visual Sign (VQ-Sign), which converts sign videos into discrete character-level tokens, and Codebook Reconstruction and Alignment (CRA), which builds word-level tokens via an optimal-transport formulation and aligns them with text via a sign-text alignment loss. By feeding the resulting language-like sign sentences into a frozen LLM with a prompt, SignLLM achieves state-of-the-art gloss-free SLT results on Phoenix-2014T and CSL-Daily, without fine-tuning the LLM. The approach demonstrates the viability of harnessing LLMs for SLT through carefully designed tokenization and alignment strategies, offering a data-efficient pathway to cross-modal translation and potential applicability across languages and datasets.

Abstract

Sign Language Translation (SLT) is a challenging task that aims to translate sign videos into spoken language. Inspired by the strong translation capabilities of large language models (LLMs) that are trained on extensive multilingual text corpora, we aim to harness off-the-shelf LLMs to handle SLT. In this paper, we regularize the sign videos to embody linguistic characteristics of spoken language, and propose a novel SignLLM framework to transform sign videos into a language-like representation for improved readability by off-the-shelf LLMs. SignLLM comprises two key modules: (1) The Vector-Quantized Visual Sign module converts sign videos into a sequence of discrete character-level sign tokens, and (2) the Codebook Reconstruction and Alignment module converts these character-level tokens into word-level sign representations using an optimal transport formulation. A sign-text alignment loss further bridges the gap between sign and text tokens, enhancing semantic compatibility. We achieve state-of-the-art gloss-free results on two widely-used SLT benchmarks.
Paper Structure (13 sections, 6 equations, 2 figures, 5 tables)

This paper contains 13 sections, 6 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: An overview of our SignLLM framework. During inference (top): Given an input sign video $X$, we first pass it through our VQ-Sign module to obtain a sequence of discrete character-level sign tokens $\hat{Z}$. Our VQ-Sign consists of a visual encoder $E_v$ to extract compact features and a character-level sign codebook $\mathbb{S}^c$ for quantization to obtain $\hat{Z}$. Next, we feed $\hat{Z}$ into our CRA module, which reorganizes $\hat{Z}$ by replacing short sequences of character tokens with word-level tokens via the word-level codebook, e.g., character sequence $[s_2,s_3,s_4]$ to word $s_2s_3s_4$. This transforms the sign video data to a language-like sign sentence $W$, which is fed into the LLM along with a text prompt which guides the LLM to generate translations in the desired language. During training (bottom): We optimize VQ-Sign and its discrete sign codebook via a context prediction task, which seeks to recognize the future time steps based on the current context information. Next, for our CRA module, we construct the optimal word-level codebook by considering two aspects: entropy and size, which we address using optimal transport techniques. Then, we narrow the gap between the sign token space and LLM's text token space via minimizing the MMD loss, which improves the semantic compatibility between them.
  • Figure 2: Visualization of translation results. Correct translations are in blue while the wrong translations are in red.