Table of Contents
Fetching ...

SignSparK: Efficient Multilingual Sign Language Production via Sparse Keyframe Learning

Jianhe Low, Alexandre Symeonidis-Herzig, Maksym Ivashechkin, Ozge Mercanoglu Sincan, Richard Bowden

TL;DR

This work introduces FAST, an ultra-efficient sign segmentation model that automatically mines precise temporal boundaries, and presents SignSparK, a large-scale Conditional Flow Matching (CFM) framework that utilizes these extracted anchors to synthesize 3D signing sequences in SMPL-X and MANO spaces, establishing the largest multilingual SLP framework to date.

Abstract

Generating natural and linguistically accurate sign language avatars remains a formidable challenge. Current Sign Language Production (SLP) frameworks face a stark trade-off: direct text-to-pose models suffer from regression-to-the-mean effects, while dictionary-retrieval methods produce robotic, disjointed transitions. To resolve this, we propose a novel training paradigm that leverages sparse keyframes to capture the true underlying kinematic distribution of human signing. By predicting dense motion from these discrete anchors, our approach mitigates regression-to-the-mean while ensuring fluid articulation. To realize this paradigm at scale, we first introduce FAST, an ultra-efficient sign segmentation model that automatically mines precise temporal boundaries. We then present SignSparK, a large-scale Conditional Flow Matching (CFM) framework that utilizes these extracted anchors to synthesize 3D signing sequences in SMPL-X and MANO spaces. This keyframe-driven formulation also uniquely unlocks Keyframe-to-Pose (KF2P) generation, making precise spatiotemporal editing of signing sequences possible. Furthermore, our adopted reconstruction-based CFM objective also enables high-fidelity synthesis in fewer than ten sampling steps; this allows SignSparK to scale across four distinct sign languages, establishing the largest multilingual SLP framework to date. Finally, by integrating 3D Gaussian Splatting for photorealistic rendering, we demonstrate through extensive evaluation that SignSparK establishes a new state-of-the-art across diverse SLP tasks and multilingual benchmarks.

SignSparK: Efficient Multilingual Sign Language Production via Sparse Keyframe Learning

TL;DR

This work introduces FAST, an ultra-efficient sign segmentation model that automatically mines precise temporal boundaries, and presents SignSparK, a large-scale Conditional Flow Matching (CFM) framework that utilizes these extracted anchors to synthesize 3D signing sequences in SMPL-X and MANO spaces, establishing the largest multilingual SLP framework to date.

Abstract

Generating natural and linguistically accurate sign language avatars remains a formidable challenge. Current Sign Language Production (SLP) frameworks face a stark trade-off: direct text-to-pose models suffer from regression-to-the-mean effects, while dictionary-retrieval methods produce robotic, disjointed transitions. To resolve this, we propose a novel training paradigm that leverages sparse keyframes to capture the true underlying kinematic distribution of human signing. By predicting dense motion from these discrete anchors, our approach mitigates regression-to-the-mean while ensuring fluid articulation. To realize this paradigm at scale, we first introduce FAST, an ultra-efficient sign segmentation model that automatically mines precise temporal boundaries. We then present SignSparK, a large-scale Conditional Flow Matching (CFM) framework that utilizes these extracted anchors to synthesize 3D signing sequences in SMPL-X and MANO spaces. This keyframe-driven formulation also uniquely unlocks Keyframe-to-Pose (KF2P) generation, making precise spatiotemporal editing of signing sequences possible. Furthermore, our adopted reconstruction-based CFM objective also enables high-fidelity synthesis in fewer than ten sampling steps; this allows SignSparK to scale across four distinct sign languages, establishing the largest multilingual SLP framework to date. Finally, by integrating 3D Gaussian Splatting for photorealistic rendering, we demonstrate through extensive evaluation that SignSparK establishes a new state-of-the-art across diverse SLP tasks and multilingual benchmarks.
Paper Structure (11 sections, 7 equations, 6 figures, 3 tables)

This paper contains 11 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: SignSparK is a Conditional Flow Matching model trained on sparse keyframes, and generates realistic and natural 3D signing avatars given spoken text. Designed for efficiency, SignSparK scales to four distinct sign languages under a unified framework.
  • Figure 2: Overview of FAST.(a) Architecture: WiLoR first extracts the left and right hand representations from input frames. These are then encoded via parallel spatio-temporal streams, concatenated, and refined by a two-stream mixer before a Transformer generates dense per-frame BIO segmentation labels. (b) Selection Policy: Leveraging the predicted BIO segments, we explicitly isolate the onset, midpoint, and offset frames of each sign to construct a semantically rich keyframe mask.
  • Figure 3: Architecture of SignSparK. (i) A sign language video is first processed by WiLoR and NLF to extract MANO and SMPL-X parameters, while its text translation is embedded via Multilingual-CLIP carlsson2022cross. (ii) FAST subsequently localizes sign segments, and the selection policy pinpoints the keyframes needed to form the control signal. (iii) A UNet, conditioned on timestep, control signal, and text, then reconstructs clean poses. (iv) These poses can then be rendered from meshes into realistic signing avatars via 3DGS. We provide further 3DGS implementation details in the supplementary
  • Figure 4: Ablation Studies(a) We ablate FAST's design, comparing single- versus two-stream architectures and refinement choices. (b) We analyze SignSparK's efficiency and performance against standard diffusion and CFM models at various sampling steps. (c) We evaluate our keyframe selection policy as well as SignSparK's loss configurations.
  • Figure 5: Dataset Ablation and User Study.(a) Investigates how the quantity of multilingual training data influences performance and whether prepending language token identifiers (e.g., <ASL>) to text inputs enhances multilingual generation. (b) Illustrates user study results comparing SignSparK against SOTA and baseline models.
  • ...and 1 more figures