Table of Contents
Fetching ...

SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation

Marshall Thomas, Edward Fish, Richard Bowden

TL;DR

SignBind-LLM introduces a modular, four-stage framework for gloss-free Sign Language Translation by separately modeling continuous signing, fingerspelling, and lipreading, then fusing with a temporal-aware transformer before decoding with a large language model. It explicitly addresses temporal misalignment between modalities and improves recognition of proper nouns and technical terms, achieving state-of-the-art BLEU-4 and letter accuracy on How2Sign, ChicagoFSWildPlus, and BOBSL. The training uses a staged curriculum with independent pretraining of modality experts, fusion, and LLM fine-tuning, enabling robust generalization across ASL and BSL datasets. Qualitative and ablation results demonstrate the pivotal role of lipreading and the fusion encoder, and show the LLM refinement step effectively converts fused pseudo-gloss streams into fluent English.

Abstract

Despite progress in gloss-free Sign Language Translation (SLT), traditional single modality end-to-end approaches consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in SLT with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names, places, and technical terms. We introduce SignBind-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign, ChicagoFSWildPlus, and BOBSL datasets with a BLEU-4 score of 22.1, 73.2% letter accuracy and BLEU-4 score of 6.8 respectively. These results validate our core hypothesis: isolating and solving distinct recognition tasks before fusion provides a more powerful and effective pathway to robust, high-fidelity sign language translation.

SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation

TL;DR

SignBind-LLM introduces a modular, four-stage framework for gloss-free Sign Language Translation by separately modeling continuous signing, fingerspelling, and lipreading, then fusing with a temporal-aware transformer before decoding with a large language model. It explicitly addresses temporal misalignment between modalities and improves recognition of proper nouns and technical terms, achieving state-of-the-art BLEU-4 and letter accuracy on How2Sign, ChicagoFSWildPlus, and BOBSL. The training uses a staged curriculum with independent pretraining of modality experts, fusion, and LLM fine-tuning, enabling robust generalization across ASL and BSL datasets. Qualitative and ablation results demonstrate the pivotal role of lipreading and the fusion encoder, and show the LLM refinement step effectively converts fused pseudo-gloss streams into fluent English.

Abstract

Despite progress in gloss-free Sign Language Translation (SLT), traditional single modality end-to-end approaches consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in SLT with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names, places, and technical terms. We introduce SignBind-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign, ChicagoFSWildPlus, and BOBSL datasets with a BLEU-4 score of 22.1, 73.2% letter accuracy and BLEU-4 score of 6.8 respectively. These results validate our core hypothesis: isolating and solving distinct recognition tasks before fusion provides a more powerful and effective pathway to robust, high-fidelity sign language translation.

Paper Structure

This paper contains 49 sections, 18 equations, 4 figures, 18 tables.

Figures (4)

  • Figure 1: Our method consists of four stages. First we perform text pre-processing, where pseudo-glosses are generated using an LLM then phonemized for lipreading. In parallel, we extract both full frame sign sequences and cropped face regions. Second, we pre-train the model's specialized predictors: continuous signing, using the extracted pseudo-glosses, fingerspelling, using a sequence of English letters and lipreading, using the extracted phonemes. A sequence classifier determines whether the given segment corresponds to signing, fingerspelling, or resting. Third, these representations are then gated and fused within a transformer-based Fusion Encoder; which is also trained on the pseudo-glosses, aligning complementary cues across modalities. Finally, the fused pseudo-gloss sequence is passed to a fine-tuned LLM to generate coherent spoken-language sentences.
  • Figure 2: An example of the translation process from How2Sign. In the output we can see how the sign model makes a number of errors in its predictions and how these are resolved into accurate pseudo-glosses by the fusion process using the phonemes from the lipreading. Then how they are used by the LLM to generate a coherent spoken English sentence.
  • Figure 3: Part-of-Speech Accuracy on How2Sign. Here we compare the individual Parts-of-Speech accuracy between our approach and two SOTA methods, C$^2$RL chen2024c2rlcontentcontextrepresentation and Geo-Sign fish2025geosignhyperboliccontrastiveregularisation. We can observe that our method performs especially well in the prediction of Nouns due to improved fingerspelling recognition. The high performance on interjections is likely an artifact of the LLM learning that most instructional videos begin with "So!" and "OK!".
  • Figure 4: Label Token Length Reduction via Pseudo-Glossing. Histograms illustrating the per-sentence reduction in token count after converting natural English to pseudo-gloss representations. The dashed red line indicates the average number of words removed. This compression significantly reduces the CTC alignment search space, enabling more stable training and better generalization.