Table of Contents
Fetching ...

Pose-Guided Fine-Grained Sign Language Video Generation

Tongkai Shi, Lianyu Hu, Fanhua Shang, Jichao Feng, Peidong Liu, Wei Feng

TL;DR

A novel Pose-Guided Motion Model for generating fine-grained and motion-consistent sign language videos that outperforms state-of-the-art methods in most benchmark tests, with visible improvements in details and temporal consistency.

Abstract

Sign language videos are an important medium for spreading and learning sign language. However, most existing human image synthesis methods produce sign language images with details that are distorted, blurred, or structurally incorrect. They also produce sign language video frames with poor temporal consistency, with anomalies such as flickering and abrupt detail changes between the previous and next frames. To address these limitations, we propose a novel Pose-Guided Motion Model (PGMM) for generating fine-grained and motion-consistent sign language videos. Firstly, we propose a new Coarse Motion Module (CMM), which completes the deformation of features by optical flow warping, thus transfering the motion of coarse-grained structures without changing the appearance; Secondly, we propose a new Pose Fusion Module (PFM), which guides the modal fusion of RGB and pose features, thus completing the fine-grained generation. Finally, we design a new metric, Temporal Consistency Difference (TCD) to quantitatively assess the degree of temporal consistency of a video by comparing the difference between the frames of the reconstructed video and the previous and next frames of the target video. Extensive qualitative and quantitative experiments show that our method outperforms state-of-the-art methods in most benchmark tests, with visible improvements in details and temporal consistency.

Pose-Guided Fine-Grained Sign Language Video Generation

TL;DR

A novel Pose-Guided Motion Model for generating fine-grained and motion-consistent sign language videos that outperforms state-of-the-art methods in most benchmark tests, with visible improvements in details and temporal consistency.

Abstract

Sign language videos are an important medium for spreading and learning sign language. However, most existing human image synthesis methods produce sign language images with details that are distorted, blurred, or structurally incorrect. They also produce sign language video frames with poor temporal consistency, with anomalies such as flickering and abrupt detail changes between the previous and next frames. To address these limitations, we propose a novel Pose-Guided Motion Model (PGMM) for generating fine-grained and motion-consistent sign language videos. Firstly, we propose a new Coarse Motion Module (CMM), which completes the deformation of features by optical flow warping, thus transfering the motion of coarse-grained structures without changing the appearance; Secondly, we propose a new Pose Fusion Module (PFM), which guides the modal fusion of RGB and pose features, thus completing the fine-grained generation. Finally, we design a new metric, Temporal Consistency Difference (TCD) to quantitatively assess the degree of temporal consistency of a video by comparing the difference between the frames of the reconstructed video and the previous and next frames of the target video. Extensive qualitative and quantitative experiments show that our method outperforms state-of-the-art methods in most benchmark tests, with visible improvements in details and temporal consistency.
Paper Structure (28 sections, 8 equations, 7 figures, 4 tables)

This paper contains 28 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Problems of existing methods: (1) Poor temporal consistency. (2) Image detail blurring and distortion
  • Figure 2: Comparison of previous two types of methods and the proposed method.
  • Figure 3: The framework of the proposed method. The decoupled representations are first extracted by Feature Extractor. It consists of RGB features, motion features, and pose features. The Generator then uses the above features to perform the image generation, which consists of multiple sets of a Coarse Motion Module for structural motion, a Pose Fusion Module for generating details through pose-guided fusion, a Convolution Module for image inpainting, and an Up Convolution Module for up-sampling features.
  • Figure 4: The structure of our Coarse Motion Module (CMM) and our Pose Fusion Module (PFM).
  • Figure 5: Qualitative comparison with DyanSTliu2022dynast, MRAAsiarohin2021motion, and TPSMMzhao2022thin on video reconstruction: Phoenix-2014T(left) and CSL-Daily(right).
  • ...and 2 more figures