Table of Contents
Fetching ...

Exploiting Ensemble Learning for Cross-View Isolated Sign Language Recognition

Fei Wang, Kun Li, Yiqi Nie, Zhangling Duan, Peng Zou, Zhiliang Wu, Yuwei Wang, Yanyan Wei

TL;DR

This work tackles Cross-View Isolated Sign Language Recognition (CV-ISLR) by addressing viewpoint variability with a two-stage ensemble framework built on multi-dimensional Video Swin Transformer models for RGB and depth streams. By combining intra-modality ensembles across Large, Base, and Small VST variants and then fusing RGB and depth predictions, the approach enhances cross-view robustness and improves recognition performance on the MM-WLAuslan dataset. The method achieves a strong showing, ranking third on both RGB and RGB-D tracks, demonstrating the value of ensemble strategies in cross-view sign language recognition. The work contributes a practical framework that leverages multi-granularity spatiotemporal features and cross-modal fusion, promoting robust cross-view ISLR applicable to real-world scenarios.

Abstract

In this paper, we present our solution to the Cross-View Isolated Sign Language Recognition (CV-ISLR) challenge held at WWW 2025. CV-ISLR addresses a critical issue in traditional Isolated Sign Language Recognition (ISLR), where existing datasets predominantly capture sign language videos from a frontal perspective, while real-world camera angles often vary. To accurately recognize sign language from different viewpoints, models must be capable of understanding gestures from multiple angles, making cross-view recognition challenging. To address this, we explore the advantages of ensemble learning, which enhances model robustness and generalization across diverse views. Our approach, built on a multi-dimensional Video Swin Transformer model, leverages this ensemble strategy to achieve competitive performance. Finally, our solution ranked 3rd in both the RGB-based ISLR and RGB-D-based ISLR tracks, demonstrating the effectiveness in handling the challenges of cross-view recognition. The code is available at: https://github.com/Jiafei127/CV_ISLR_WWW2025.

Exploiting Ensemble Learning for Cross-View Isolated Sign Language Recognition

TL;DR

This work tackles Cross-View Isolated Sign Language Recognition (CV-ISLR) by addressing viewpoint variability with a two-stage ensemble framework built on multi-dimensional Video Swin Transformer models for RGB and depth streams. By combining intra-modality ensembles across Large, Base, and Small VST variants and then fusing RGB and depth predictions, the approach enhances cross-view robustness and improves recognition performance on the MM-WLAuslan dataset. The method achieves a strong showing, ranking third on both RGB and RGB-D tracks, demonstrating the value of ensemble strategies in cross-view sign language recognition. The work contributes a practical framework that leverages multi-granularity spatiotemporal features and cross-modal fusion, promoting robust cross-view ISLR applicable to real-world scenarios.

Abstract

In this paper, we present our solution to the Cross-View Isolated Sign Language Recognition (CV-ISLR) challenge held at WWW 2025. CV-ISLR addresses a critical issue in traditional Isolated Sign Language Recognition (ISLR), where existing datasets predominantly capture sign language videos from a frontal perspective, while real-world camera angles often vary. To accurately recognize sign language from different viewpoints, models must be capable of understanding gestures from multiple angles, making cross-view recognition challenging. To address this, we explore the advantages of ensemble learning, which enhances model robustness and generalization across diverse views. Our approach, built on a multi-dimensional Video Swin Transformer model, leverages this ensemble strategy to achieve competitive performance. Finally, our solution ranked 3rd in both the RGB-based ISLR and RGB-D-based ISLR tracks, demonstrating the effectiveness in handling the challenges of cross-view recognition. The code is available at: https://github.com/Jiafei127/CV_ISLR_WWW2025.

Paper Structure

This paper contains 15 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The MM-WLAuslan dataset for Cross-View Isolated Sign Language Recognition (CV-ISLR) includes RGB and depth videos in train (front view), validation (left view), and test (left and right views) datasets.
  • Figure 2: Illustration of the ensemble learning process. Multiple classifiers are trained on the same dataset, and their predictions are aggregated to form an ensemble model with improved robustness and accuracy.
  • Figure 3: Overview of the proposed architecture for CV-ISLR. The architecture processes RGB and depth videos through Video Swin Transformer blocks, with multi-dimensional models for both branches. An ensemble learning approach is applied in two stages: single-modal classification and multi-modal fusion to improve performance across different viewpoints.