Table of Contents
Fetching ...

Zero-Shot Underwater Gesture Recognition

Sandipan Sarma, Gundameedi Sai Ram Mohan, Hariansh Sehgal, Arijit Sur

Abstract

Hand gesture recognition allows humans to interact with machines non-verbally, which has a huge application in underwater exploration using autonomous underwater vehicles. Recently, a new gesture-based language called CADDIAN has been devised for divers, and supervised learning methods have been applied to recognize the gestures with high accuracy. However, such methods fail when they encounter unseen gestures in real time. In this work, we advocate the need for zero-shot underwater gesture recognition (ZSUGR), where the objective is to train a model with visual samples of gestures from a few ``seen'' classes only and transfer the gained knowledge at test time to recognize semantically-similar unseen gesture classes as well. After discussing the problem and dataset-specific challenges, we propose new seen-unseen splits for gesture classes in CADDY dataset. Then, we present a two-stage framework, where a novel transformer learns strong visual gesture cues and feeds them to a conditional generative adversarial network that learns to mimic feature distribution. We use the trained generator as a feature synthesizer for unseen classes, enabling zero-shot learning. Extensive experiments demonstrate that our method outperforms the existing zero-shot techniques. We conclude by providing useful insights into our framework and suggesting directions for future research.

Zero-Shot Underwater Gesture Recognition

Abstract

Hand gesture recognition allows humans to interact with machines non-verbally, which has a huge application in underwater exploration using autonomous underwater vehicles. Recently, a new gesture-based language called CADDIAN has been devised for divers, and supervised learning methods have been applied to recognize the gestures with high accuracy. However, such methods fail when they encounter unseen gestures in real time. In this work, we advocate the need for zero-shot underwater gesture recognition (ZSUGR), where the objective is to train a model with visual samples of gestures from a few ``seen'' classes only and transfer the gained knowledge at test time to recognize semantically-similar unseen gesture classes as well. After discussing the problem and dataset-specific challenges, we propose new seen-unseen splits for gesture classes in CADDY dataset. Then, we present a two-stage framework, where a novel transformer learns strong visual gesture cues and feeds them to a conditional generative adversarial network that learns to mimic feature distribution. We use the trained generator as a feature synthesizer for unseen classes, enabling zero-shot learning. Extensive experiments demonstrate that our method outperforms the existing zero-shot techniques. We conclude by providing useful insights into our framework and suggesting directions for future research.
Paper Structure (24 sections, 9 equations, 6 figures, 4 tables)

This paper contains 24 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Properties of the CADDY dataset gomez2019caddy.
  • Figure 2: Proposed two-stage framework for ZSUGR. Here, $z$ denotes a random noise vector, and $a$ denotes semantic vector of a gesture class.
  • Figure 3: Architecture of our novel GCAT gesture decoder.
  • Figure 4: Comparison of our GZSL confusion matrix with the state-of-the-art.
  • Figure 5: Component analysis of our framework for the three proposed random splits. Top-1 accuracy is reported in CZSL ($\boldsymbol{U_{czsl}}$) and GZSL settings ($\boldsymbol{S_{gzsl}}$ and $\boldsymbol{U_{gzsl}}$ with harmonic mean $\boldsymbol{H}$). E = GCAT encoder, RN-101 = ResNet-101 as feature extractor, Dec = GCAT decoder, c-WGAN = Conditional WGAN.
  • ...and 1 more figures