Table of Contents
Fetching ...

Signformer is all you need: Towards Edge AI for Sign Language

Eta Yang

TL;DR

This paper presents nature analysis of sign languages to inform the algorithmic design and deliver a scalable transformer pipeline with convolution and attention novelty, and achieves new 2nd place on leaderboard with a parametric reduction of 467-1807x against the finests as of 2024.

Abstract

Sign language translation, especially in gloss-free paradigm, is confronting a dilemma of impracticality and unsustainability due to growing resource-intensive methodologies. Contemporary state-of-the-arts (SOTAs) have significantly hinged on pretrained sophiscated backbones such as Large Language Models (LLMs), embedding sources, or extensive datasets, inducing considerable parametric and computational inefficiency for sustainable use in real-world scenario. Despite their success, following this research direction undermines the overarching mission of this domain to create substantial value to bridge hard-hearing and common populations. Committing to the prevailing trend of LLM and Natural Language Processing (NLP) studies, we pursue a profound essential change in architecture to achieve ground-up improvements without external aid from pretrained models, prior knowledge transfer, or any NLP strategies considered not-from-scratch. Introducing Signformer, a from-scratch Feather-Giant transforming the area towards Edge AI that redefines extremities of performance and efficiency with LLM-competence and edgy-deployable compactness. In this paper, we present nature analysis of sign languages to inform our algorithmic design and deliver a scalable transformer pipeline with convolution and attention novelty. We achieve new 2nd place on leaderboard with a parametric reduction of 467-1807x against the finests as of 2024 and outcompete almost every other methods in a lighter configuration of 0.57 million parameters.

Signformer is all you need: Towards Edge AI for Sign Language

TL;DR

This paper presents nature analysis of sign languages to inform the algorithmic design and deliver a scalable transformer pipeline with convolution and attention novelty, and achieves new 2nd place on leaderboard with a parametric reduction of 467-1807x against the finests as of 2024.

Abstract

Sign language translation, especially in gloss-free paradigm, is confronting a dilemma of impracticality and unsustainability due to growing resource-intensive methodologies. Contemporary state-of-the-arts (SOTAs) have significantly hinged on pretrained sophiscated backbones such as Large Language Models (LLMs), embedding sources, or extensive datasets, inducing considerable parametric and computational inefficiency for sustainable use in real-world scenario. Despite their success, following this research direction undermines the overarching mission of this domain to create substantial value to bridge hard-hearing and common populations. Committing to the prevailing trend of LLM and Natural Language Processing (NLP) studies, we pursue a profound essential change in architecture to achieve ground-up improvements without external aid from pretrained models, prior knowledge transfer, or any NLP strategies considered not-from-scratch. Introducing Signformer, a from-scratch Feather-Giant transforming the area towards Edge AI that redefines extremities of performance and efficiency with LLM-competence and edgy-deployable compactness. In this paper, we present nature analysis of sign languages to inform our algorithmic design and deliver a scalable transformer pipeline with convolution and attention novelty. We achieve new 2nd place on leaderboard with a parametric reduction of 467-1807x against the finests as of 2024 and outcompete almost every other methods in a lighter configuration of 0.57 million parameters.

Paper Structure

This paper contains 18 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: (TOP): Gloss-Free TOP5 Leaderboard 2024; (Down): Sign Language Translation (Gloss-Based & Gloss-Free) TOP5 Leaderboard 2024. Our Signformers, using 3M parameters, achieve the 2nd place on against finests ranged from 1-7B parameters, and exhibit approximate performance to most SLT gloss-based approaches, while considerably advancing in efficiency, Information Density, and NetScore netscore
  • Figure 2: Architecture of Signformer, composed of a convolutional module, CoPE-Gloss Attention, and CoPE-Cross Attention. Model is built and trained from-scratch without external embedding or pretrained source, utilizing raw spatial and word embedding layers.
  • Figure 3: Convolution Module, composed of a stacked Pointwise-Depthwise-Pointwise 1D convolution encapsulated within a symmetric LayerNormalization flow, designed to incorporate with gloss attention to extract higher-level visual features.