Table of Contents
Fetching ...

Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture

Samuel Ebimobowei Johnny, Blessed Guda, Emmanuel Enejo Aaron, Assane Gueye

TL;DR

The paper tackles sign language spotting by defining a video-to-video search task: given a query sign video and a sentence-level sign video, determine if the query occurs in the sentence. It proposes an end-to-end pose-based framework using a Pose CNN Encoder and a Visual Transformer to learn cross-modal representations from pose keypoints, predicting presence with binary cross-entropy. On the Word Presence Prediction dataset, the method achieves 61.66% accuracy, demonstrating that pose-based representations can efficiently detect sign presence while suppressing visual noise. Ablation studies show that 2D pose encoding and BCE loss are crucial for performance, establishing a solid baseline for future automatic sign language retrieval and verification tasks.

Abstract

Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on pose representations instead of raw RGB frames, our method significantly reduces computational cost and mitigates visual noise. We evaluate our approach on the Word Presence Prediction dataset from the WSLP 2025 shared task, achieving 61.88\% accuracy and 60.00\% F1-score. These results demonstrate the effectiveness of our pose-based framework for Sign Language Spotting, establishing a strong foundation for future research in automatic sign language retrieval and verification. Code is available at https://github.com/EbimoJohnny/Pose-Based-Sign-Language-Spotting

Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture

TL;DR

The paper tackles sign language spotting by defining a video-to-video search task: given a query sign video and a sentence-level sign video, determine if the query occurs in the sentence. It proposes an end-to-end pose-based framework using a Pose CNN Encoder and a Visual Transformer to learn cross-modal representations from pose keypoints, predicting presence with binary cross-entropy. On the Word Presence Prediction dataset, the method achieves 61.66% accuracy, demonstrating that pose-based representations can efficiently detect sign presence while suppressing visual noise. Ablation studies show that 2D pose encoding and BCE loss are crucial for performance, establishing a solid baseline for future automatic sign language retrieval and verification tasks.

Abstract

Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on pose representations instead of raw RGB frames, our method significantly reduces computational cost and mitigates visual noise. We evaluate our approach on the Word Presence Prediction dataset from the WSLP 2025 shared task, achieving 61.88\% accuracy and 60.00\% F1-score. These results demonstrate the effectiveness of our pose-based framework for Sign Language Spotting, establishing a strong foundation for future research in automatic sign language retrieval and verification. Code is available at https://github.com/EbimoJohnny/Pose-Based-Sign-Language-Spotting

Paper Structure

This paper contains 15 sections, 3 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Architecture overview. Pose sequences are encoded using 2D CNNs and then processed by a Transformer encoder, which produces visual tokens. The [CLS] token is max-pooled to predict query presence using binary cross-entropy loss ($\mathcal{L}_{BCE}$).