Table of Contents
Fetching ...

Motion-Based Sign Language Video Summarization using Curvature and Torsion

Evangelos G. Sartinas, Emmanouil Z. Psarakis, Dimitrios I. Kosmopoulos

TL;DR

A new informative function based on the $t-parameterized curvature and torsion of the 3-D trajectory is proposed and is experimentally evaluated in applications of sign language videos on objective measures using ground-truth keyframe annotations, human-based evaluation of understanding, and gloss classification and the results obtained are promising.

Abstract

An interesting problem in many video-based applications is the generation of short synopses by selecting the most informative frames, a procedure which is known as video summarization. For sign language videos the benefits of using the $t$-parameterized counterpart of the curvature of the 2-D signer's wrist trajectory to identify keyframes, have been recently reported in the literature. In this paper we extend these ideas by modeling the 3-D hand motion that is extracted from each frame of the video. To this end we propose a new informative function based on the $t$-parameterized curvature and torsion of the 3-D trajectory. The method to characterize video frames as keyframes depends on whether the motion occurs in 2-D or 3-D space. Specifically, in the case of 3-D motion we look for the maxima of the harmonic mean of the curvature and torsion of the target's trajectory; in the planar motion case we seek for the maxima of the trajectory's curvature. The proposed 3-D feature is experimentally evaluated in applications of sign language videos on (1) objective measures using ground-truth keyframe annotations, (2) human-based evaluation of understanding, and (3) gloss classification and the results obtained are promising.

Motion-Based Sign Language Video Summarization using Curvature and Torsion

TL;DR

A new informative function based on the $t-parameterized curvature and torsion of the 3-D trajectory is proposed and is experimentally evaluated in applications of sign language videos on objective measures using ground-truth keyframe annotations, human-based evaluation of understanding, and gloss classification and the results obtained are promising.

Abstract

An interesting problem in many video-based applications is the generation of short synopses by selecting the most informative frames, a procedure which is known as video summarization. For sign language videos the benefits of using the -parameterized counterpart of the curvature of the 2-D signer's wrist trajectory to identify keyframes, have been recently reported in the literature. In this paper we extend these ideas by modeling the 3-D hand motion that is extracted from each frame of the video. To this end we propose a new informative function based on the -parameterized curvature and torsion of the 3-D trajectory. The method to characterize video frames as keyframes depends on whether the motion occurs in 2-D or 3-D space. Specifically, in the case of 3-D motion we look for the maxima of the harmonic mean of the curvature and torsion of the target's trajectory; in the planar motion case we seek for the maxima of the trajectory's curvature. The proposed 3-D feature is experimentally evaluated in applications of sign language videos on (1) objective measures using ground-truth keyframe annotations, (2) human-based evaluation of understanding, and (3) gloss classification and the results obtained are promising.
Paper Structure (14 sections, 27 equations, 3 figures, 3 tables)

This paper contains 14 sections, 27 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The consecutive frames from rest to rest position corresponding to the sign "$\kappa\alpha\lambda\eta\mu\acute{\epsilon}\rho\alpha$" (good morning)
  • Figure 2: The proposed overall summarization framework (dotted box). Its output is used for the objective measures evaluation (Section \ref{['subsec:Objective']}), by comparing it with the ground truth keyframes, and the creation of the database for human based evaluation (Section \ref{['subsec:HumanBased']}) and gloss classification (Section \ref{['subsec:GlossClassification']})
  • Figure 3: Obtained results in terms of (a, c) $F_2$ score, (b, d) Recall rate and (e) relative mean captured sign's complexity metric $C_s$. Versus $R_c$ for $\Delta=5$ (a - b) and versus temporal proximity threshold $\Delta$ for $R_c=1, 2$ (c - d)