Table of Contents
Fetching ...

IlluSign: Illustrating Sign Language Videos by Leveraging the Attention Mechanism

Janna Bruner, Amit Moryossef, Lior Wolf

TL;DR

IlluSign presents a zero-shot, diffusion-based method to convert sign language video frames into sketch-like illustrations that preserve hand gestures and motion. The pipeline is modular, employing edge-driven geometry, style-injected high-resolution attention, and a two-frame fusion (start and end) with trajectory arrows to convey motion. It integrates segmentation, hand/armpath masking, and trajectory visualization to produce cost-effective educational illustrations that complement video material. The approach demonstrates strong qualitative results, competitive quantitative metrics, and ablations that justify design choices, offering a scalable resource for sign-language education and dictionaries.

Abstract

Sign languages are dynamic visual languages that involve hand gestures, in combination with non manual elements such as facial expressions. While video recordings of sign language are commonly used for education and documentation, the dynamic nature of signs can make it challenging to study them in detail, especially for new learners and educators. This work aims to convert sign language video footage into static illustrations, which serve as an additional educational resource to complement video content. This process is usually done by an artist, and is therefore quite costly. We propose a method that illustrates sign language videos by leveraging generative models' ability to understand both the semantic and geometric aspects of images. Our approach focuses on transferring a sketch like illustration style to video footage of sign language, combining the start and end frames of a sign into a single illustration, and using arrows to highlight the hand's direction and motion. While many style transfer methods address domain adaptation at varying levels of abstraction, applying a sketch like style to sign languages, especially for hand gestures and facial expressions, poses a significant challenge. To tackle this, we intervene in the denoising process of a diffusion model, injecting style as keys and values into high resolution attention layers, and fusing geometric information from the image and edges as queries. For the final illustration, we use the attention mechanism to combine the attention weights from both the start and end illustrations, resulting in a soft combination. Our method offers a cost effective solution for generating sign language illustrations at inference time, addressing the lack of such resources in educational materials.

IlluSign: Illustrating Sign Language Videos by Leveraging the Attention Mechanism

TL;DR

IlluSign presents a zero-shot, diffusion-based method to convert sign language video frames into sketch-like illustrations that preserve hand gestures and motion. The pipeline is modular, employing edge-driven geometry, style-injected high-resolution attention, and a two-frame fusion (start and end) with trajectory arrows to convey motion. It integrates segmentation, hand/armpath masking, and trajectory visualization to produce cost-effective educational illustrations that complement video material. The approach demonstrates strong qualitative results, competitive quantitative metrics, and ablations that justify design choices, offering a scalable resource for sign-language education and dictionaries.

Abstract

Sign languages are dynamic visual languages that involve hand gestures, in combination with non manual elements such as facial expressions. While video recordings of sign language are commonly used for education and documentation, the dynamic nature of signs can make it challenging to study them in detail, especially for new learners and educators. This work aims to convert sign language video footage into static illustrations, which serve as an additional educational resource to complement video content. This process is usually done by an artist, and is therefore quite costly. We propose a method that illustrates sign language videos by leveraging generative models' ability to understand both the semantic and geometric aspects of images. Our approach focuses on transferring a sketch like illustration style to video footage of sign language, combining the start and end frames of a sign into a single illustration, and using arrows to highlight the hand's direction and motion. While many style transfer methods address domain adaptation at varying levels of abstraction, applying a sketch like style to sign languages, especially for hand gestures and facial expressions, poses a significant challenge. To tackle this, we intervene in the denoising process of a diffusion model, injecting style as keys and values into high resolution attention layers, and fusing geometric information from the image and edges as queries. For the final illustration, we use the attention mechanism to combine the attention weights from both the start and end illustrations, resulting in a soft combination. Our method offers a cost effective solution for generating sign language illustrations at inference time, addressing the lack of such resources in educational materials.

Paper Structure

This paper contains 14 sections, 9 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: In this work, we present a method for transforming sign language video frames into illustrations that capture the geometric details of the images, emphasizing hand gestures, direction and motion.
  • Figure 2: Method Overview: In the first stage we start by inverting images to their latent noise representation and initiate the diffusion process from the edges image noise. In the final resolution attention layers of the decoder, we inject the Keys and Values from the style image, and the Queries as a linear combination of the queries derived from the image and the queries from the edge map. In the second stage we apply another diffusion process to fuse query features from the start image and end image, initializing with the latent noise of the start sign image. In the last resolution attention layers, we inject into the queries a combination of unsimilar features and hand masks. The unsimilar features between the queries contribute to the soft overlay of the images, while the hand masks enhance the appearance of the hands.
  • Figure 3: Final Illustrations, including the intermediate steps of our method. The process begins with two input frames, which are first transformed into the desired illustration style. Next, an overlay step is applied, followed by the addition of directional arrows. The rightmost column displays the ground truth (GT) illustration from the online dictionary signsuisse.
  • Figure 4: Qualitative comparison of random samples from our data. Each input image is taken from a video frame as the driving image, and the style image corresponds to the illustration style. We compare our results to five baselines. In our method, we use two inputs: the input image (Iimg) and an edge map (Iedges).
  • Figure 5: Generalization to Other Styles: We demonstrate that our method is comparable to state-of-the-art style transfer techniques, effectively generalizing across different domains and styles. Notably, it excels in preserving structural and fine details, particularly on subjects such as human faces.
  • ...and 3 more figures