LLMs are Good Sign Language Translators
Jia Gong, Lin Geng Foo, Yixuan He, Hossein Rahmani, Jun Liu
TL;DR
This work addresses the challenge of translating sign language videos to spoken language by leveraging off-the-shelf, frozen large language models (LLMs). It introduces SignLLM, a framework that regularizes sign videos into language-like representations through two modules: Vector-Quantized Visual Sign (VQ-Sign), which converts sign videos into discrete character-level tokens, and Codebook Reconstruction and Alignment (CRA), which builds word-level tokens via an optimal-transport formulation and aligns them with text via a sign-text alignment loss. By feeding the resulting language-like sign sentences into a frozen LLM with a prompt, SignLLM achieves state-of-the-art gloss-free SLT results on Phoenix-2014T and CSL-Daily, without fine-tuning the LLM. The approach demonstrates the viability of harnessing LLMs for SLT through carefully designed tokenization and alignment strategies, offering a data-efficient pathway to cross-modal translation and potential applicability across languages and datasets.
Abstract
Sign Language Translation (SLT) is a challenging task that aims to translate sign videos into spoken language. Inspired by the strong translation capabilities of large language models (LLMs) that are trained on extensive multilingual text corpora, we aim to harness off-the-shelf LLMs to handle SLT. In this paper, we regularize the sign videos to embody linguistic characteristics of spoken language, and propose a novel SignLLM framework to transform sign videos into a language-like representation for improved readability by off-the-shelf LLMs. SignLLM comprises two key modules: (1) The Vector-Quantized Visual Sign module converts sign videos into a sequence of discrete character-level sign tokens, and (2) the Codebook Reconstruction and Alignment module converts these character-level tokens into word-level sign representations using an optimal transport formulation. A sign-text alignment loss further bridges the gap between sign and text tokens, enhancing semantic compatibility. We achieve state-of-the-art gloss-free results on two widely-used SLT benchmarks.
