Table of Contents
Fetching ...

CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition

Sarah Alyami, Hamzah Luqman

TL;DR

This work tackles the data- and compute-intensive challenge of continuous sign language recognition (CSLR) by repurposing large vision-language models. It introduces CLIP-SLA, a parameter-efficient framework that freezes CLIP's visual backbone and adds two PEFT-based variants, SLA-LoRA and SLA-Adapter, to embed temporal modeling via TSM or 3DConv adapters and a CSLR sequence module trained with CTC and VAC losses. The methods achieve competitive or state-of-the-art performance on multiple CSLR benchmarks (Phoenix2014, Phoenix2014-T, CSL-Daily, Isharah-500) with far fewer trainable parameters, and extensive ablations validate the effectiveness of temporal integration and adapter design. The results demonstrate the viability of adapting large pre-trained VLMs for efficient CSLR, pointing to future directions in PEFT methods and diverse VLMs for sign-language understanding.

Abstract

Continuous sign language recognition (CSLR) focuses on interpreting and transcribing sequences of sign language gestures in videos. In this work, we propose CLIP sign language adaptation (CLIP-SLA), a novel CSLR framework that leverages the powerful pre-trained visual encoder from the CLIP model to sign language tasks through parameter-efficient fine-tuning (PEFT). We introduce two variants, SLA-Adapter and SLA-LoRA, which integrate PEFT modules into the CLIP visual encoder, enabling fine-tuning with minimal trainable parameters. The effectiveness of the proposed frameworks is validated on four datasets: Phoenix2014, Phoenix2014-T, CSL-Daily, and Isharah-500, where both CLIP-SLA variants outperformed several SOTA models with fewer trainable parameters. Extensive ablation studies emphasize the effectiveness and flexibility of the proposed methods with different vision-language models for CSLR. These findings showcase the potential of adapting large-scale pre-trained models for scalable and efficient CSLR, which pave the way for future advancements in sign language understanding.

CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition

TL;DR

This work tackles the data- and compute-intensive challenge of continuous sign language recognition (CSLR) by repurposing large vision-language models. It introduces CLIP-SLA, a parameter-efficient framework that freezes CLIP's visual backbone and adds two PEFT-based variants, SLA-LoRA and SLA-Adapter, to embed temporal modeling via TSM or 3DConv adapters and a CSLR sequence module trained with CTC and VAC losses. The methods achieve competitive or state-of-the-art performance on multiple CSLR benchmarks (Phoenix2014, Phoenix2014-T, CSL-Daily, Isharah-500) with far fewer trainable parameters, and extensive ablations validate the effectiveness of temporal integration and adapter design. The results demonstrate the viability of adapting large pre-trained VLMs for efficient CSLR, pointing to future directions in PEFT methods and diverse VLMs for sign-language understanding.

Abstract

Continuous sign language recognition (CSLR) focuses on interpreting and transcribing sequences of sign language gestures in videos. In this work, we propose CLIP sign language adaptation (CLIP-SLA), a novel CSLR framework that leverages the powerful pre-trained visual encoder from the CLIP model to sign language tasks through parameter-efficient fine-tuning (PEFT). We introduce two variants, SLA-Adapter and SLA-LoRA, which integrate PEFT modules into the CLIP visual encoder, enabling fine-tuning with minimal trainable parameters. The effectiveness of the proposed frameworks is validated on four datasets: Phoenix2014, Phoenix2014-T, CSL-Daily, and Isharah-500, where both CLIP-SLA variants outperformed several SOTA models with fewer trainable parameters. Extensive ablation studies emphasize the effectiveness and flexibility of the proposed methods with different vision-language models for CSLR. These findings showcase the potential of adapting large-scale pre-trained models for scalable and efficient CSLR, which pave the way for future advancements in sign language understanding.

Paper Structure

This paper contains 9 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The architecture of SLA-LoRA module. It shows the integration of the TSM and LoRA modules within the MHSA and MLP blocks of the ViT-based CLIP visual encoder.
  • Figure 2: Overview of the proposed SLA-Adapter framework where the adapters are placed before the MHSA and MLP blocks. A detailed view of the time-aware adapter shows that the 3DConv layer is inserted between the downward and upward projections for effective spatio-temporal adaptation.
  • Figure 3: Samples from the Isharah-500 dataset captured using smartphone cameras in unrestricted settings.
  • Figure 4: Visualizations of Grad-CAM from SLA-LoRA (2nd row) and SLA-Adapter (bottom row) showing focused attention to informative regions in sign language like hands and face.