Table of Contents
Fetching ...

RVLF: A Reinforcing Vision-Language Framework for Gloss-Free Sign Language Translation

Zhi Rao, Yucheng Zhou, Benjia Zhou, Yiqing Huang, Sergio Escalera, Jun Wan

TL;DR

RVLF presents a three-stage framework for gloss-free sign language translation that combines cross-lingual vision-language pre-training, instruction-tuned large language model fine-tuning, and GRPO-based sentence-level reinforcement. The approach fuses skeleton-based motion cues with dense DINOv2 visual features, enabling richer sign representations, and uses a GRPO reward that combines BLEU-4 and ROUGE-L to align translations with reference semantics. Across CSL-Daily, PHOENIX-2014T, How2Sign, and OpenASL, RVLF achieves state-of-the-art results without external large-scale sign-language pretraining, significantly closing the gap to gloss-based methods. The work demonstrates the importance of both expressive sign representations and sentence-level optimization for robust, gloss-free SLT in diverse languages and domains.

Abstract

Gloss-free sign language translation (SLT) is hindered by two key challenges: **inadequate sign representation** that fails to capture nuanced visual cues, and **sentence-level semantic misalignment** in current LLM-based methods, which limits translation quality. To address these issues, we propose a three-stage **r**einforcing **v**ision-**l**anguage **f**ramework (**RVLF**). We build a large vision-language model (LVLM) specifically designed for sign language, and then combine it with reinforcement learning (RL) to adaptively enhance translation performance. First, for a sufficient representation of sign language, RVLF introduces an effective semantic representation learning mechanism that fuses skeleton-based motion cues with semantically rich visual features extracted via DINOv2, followed by instruction tuning to obtain a strong SLT-SFT baseline. Then, to improve sentence-level semantic misalignment, we introduce a GRPO-based optimization strategy that fine-tunes the SLT-SFT model with a reward function combining translation fidelity (BLEU) and sentence completeness (ROUGE), yielding the optimized model termed SLT-GRPO. Our conceptually simple framework yields substantial gains under the gloss-free SLT setting without pre-training on any external large-scale sign language datasets, improving BLEU-4 scores by +5.1, +1.11, +1.4, and +1.61 on the CSL-Daily, PHOENIX-2014T, How2Sign, and OpenASL datasets, respectively. To the best of our knowledge, this is the first work to incorporate GRPO into SLT. Extensive experiments and ablation studies validate the effectiveness of GRPO-based optimization in enhancing both translation quality and semantic consistency.

RVLF: A Reinforcing Vision-Language Framework for Gloss-Free Sign Language Translation

TL;DR

RVLF presents a three-stage framework for gloss-free sign language translation that combines cross-lingual vision-language pre-training, instruction-tuned large language model fine-tuning, and GRPO-based sentence-level reinforcement. The approach fuses skeleton-based motion cues with dense DINOv2 visual features, enabling richer sign representations, and uses a GRPO reward that combines BLEU-4 and ROUGE-L to align translations with reference semantics. Across CSL-Daily, PHOENIX-2014T, How2Sign, and OpenASL, RVLF achieves state-of-the-art results without external large-scale sign-language pretraining, significantly closing the gap to gloss-based methods. The work demonstrates the importance of both expressive sign representations and sentence-level optimization for robust, gloss-free SLT in diverse languages and domains.

Abstract

Gloss-free sign language translation (SLT) is hindered by two key challenges: **inadequate sign representation** that fails to capture nuanced visual cues, and **sentence-level semantic misalignment** in current LLM-based methods, which limits translation quality. To address these issues, we propose a three-stage **r**einforcing **v**ision-**l**anguage **f**ramework (**RVLF**). We build a large vision-language model (LVLM) specifically designed for sign language, and then combine it with reinforcement learning (RL) to adaptively enhance translation performance. First, for a sufficient representation of sign language, RVLF introduces an effective semantic representation learning mechanism that fuses skeleton-based motion cues with semantically rich visual features extracted via DINOv2, followed by instruction tuning to obtain a strong SLT-SFT baseline. Then, to improve sentence-level semantic misalignment, we introduce a GRPO-based optimization strategy that fine-tunes the SLT-SFT model with a reward function combining translation fidelity (BLEU) and sentence completeness (ROUGE), yielding the optimized model termed SLT-GRPO. Our conceptually simple framework yields substantial gains under the gloss-free SLT setting without pre-training on any external large-scale sign language datasets, improving BLEU-4 scores by +5.1, +1.11, +1.4, and +1.61 on the CSL-Daily, PHOENIX-2014T, How2Sign, and OpenASL datasets, respectively. To the best of our knowledge, this is the first work to incorporate GRPO into SLT. Extensive experiments and ablation studies validate the effectiveness of GRPO-based optimization in enhancing both translation quality and semantic consistency.

Paper Structure

This paper contains 43 sections, 15 equations, 7 figures, 15 tables.

Figures (7)

  • Figure 1: The overview illustrates the difference between previous gloss-free SLT frameworks (a) and the proposed RVLF framework (b), along with the performance of our method on the CSL-Daily SignBT dataset (c). REF: Reinforcement Fine-Tuning.
  • Figure 2: The framework of RVLF. Stage 1 (Pre-training): A vision-language foundation is built using contrastive ($\mathcal{L}_{con}$) and translation ($\mathcal{L}_{slt}$) losses. Stage 2 (Supervised Fine-Tuning): Adequate sign language representations are used as visual inputs, while appropriate instruction tuning is applied to the large language model to construct a vision–language model specifically tailored for sign language understanding and translation. Stage 3 (Reinforcement Fine-Tuning): A GRPO-based optimization strategy is employed to further fine-tune the policy model from the previous stage using a reward function based on sentence-level translation metrics, improving the overall translation performance.
  • Figure 3: Learning curves of SLT-GRPO on the CSL-Daily dataset. The baseline corresponds to the SLT-SFT model. The performance shown in the SLT-GRPO curve reflects results on the validation set of CSL-Daily.
  • Figure 4: Visualization of the input length distribution of sign language videos in the training sets of four datasets: PHOENIX-2014T, CSL-Daily, PHOENIX-2014T How2Sign, and OpenASL. Due to the significant long-tail phenomenon in the frame count distributions of How2Sign and OpenASL, only samples with a frame count of 1000 or fewer are included in the visualization to improve readability, while still covering the vast majority of samples in both datasets.
  • Figure 5: Stacked histogram of frame-to-frame similarity for full body, face, and hand regions.
  • ...and 2 more figures