CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition
Sarah Alyami, Hamzah Luqman
TL;DR
This work tackles the data- and compute-intensive challenge of continuous sign language recognition (CSLR) by repurposing large vision-language models. It introduces CLIP-SLA, a parameter-efficient framework that freezes CLIP's visual backbone and adds two PEFT-based variants, SLA-LoRA and SLA-Adapter, to embed temporal modeling via TSM or 3DConv adapters and a CSLR sequence module trained with CTC and VAC losses. The methods achieve competitive or state-of-the-art performance on multiple CSLR benchmarks (Phoenix2014, Phoenix2014-T, CSL-Daily, Isharah-500) with far fewer trainable parameters, and extensive ablations validate the effectiveness of temporal integration and adapter design. The results demonstrate the viability of adapting large pre-trained VLMs for efficient CSLR, pointing to future directions in PEFT methods and diverse VLMs for sign-language understanding.
Abstract
Continuous sign language recognition (CSLR) focuses on interpreting and transcribing sequences of sign language gestures in videos. In this work, we propose CLIP sign language adaptation (CLIP-SLA), a novel CSLR framework that leverages the powerful pre-trained visual encoder from the CLIP model to sign language tasks through parameter-efficient fine-tuning (PEFT). We introduce two variants, SLA-Adapter and SLA-LoRA, which integrate PEFT modules into the CLIP visual encoder, enabling fine-tuning with minimal trainable parameters. The effectiveness of the proposed frameworks is validated on four datasets: Phoenix2014, Phoenix2014-T, CSL-Daily, and Isharah-500, where both CLIP-SLA variants outperformed several SOTA models with fewer trainable parameters. Extensive ablation studies emphasize the effectiveness and flexibility of the proposed methods with different vision-language models for CSLR. These findings showcase the potential of adapting large-scale pre-trained models for scalable and efficient CSLR, which pave the way for future advancements in sign language understanding.
