SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation
Marshall Thomas, Edward Fish, Richard Bowden
TL;DR
SignBind-LLM introduces a modular, four-stage framework for gloss-free Sign Language Translation by separately modeling continuous signing, fingerspelling, and lipreading, then fusing with a temporal-aware transformer before decoding with a large language model. It explicitly addresses temporal misalignment between modalities and improves recognition of proper nouns and technical terms, achieving state-of-the-art BLEU-4 and letter accuracy on How2Sign, ChicagoFSWildPlus, and BOBSL. The training uses a staged curriculum with independent pretraining of modality experts, fusion, and LLM fine-tuning, enabling robust generalization across ASL and BSL datasets. Qualitative and ablation results demonstrate the pivotal role of lipreading and the fusion encoder, and show the LLM refinement step effectively converts fused pseudo-gloss streams into fluent English.
Abstract
Despite progress in gloss-free Sign Language Translation (SLT), traditional single modality end-to-end approaches consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in SLT with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names, places, and technical terms. We introduce SignBind-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign, ChicagoFSWildPlus, and BOBSL datasets with a BLEU-4 score of 22.1, 73.2% letter accuracy and BLEU-4 score of 6.8 respectively. These results validate our core hypothesis: isolating and solving distinct recognition tasks before fusion provides a more powerful and effective pathway to robust, high-fidelity sign language translation.
