Table of Contents
Fetching ...

RingGesture: A Ring-Based Mid-Air Gesture Typing System Powered by a Deep-Learning Word Prediction Framework

Junxiao Shen, Roger Boldu, Arpit Kalla, Michael Glueck, Hemant Bhaskar Surale Amy Karlson

TL;DR

A ring-based mid-air gesture typing technique, RingGesture, utilizing electrodes to mark the start and end of gesture trajectories and inertial measurement units (IMU) sensors for hand tracking is proposed, offering an intuitive experience similar to raycast-based mid-air gesture typing found in VR headsets.

Abstract

Text entry is a critical capability for any modern computing experience, with lightweight augmented reality (AR) glasses being no exception. Designed for all-day wearability, a limitation of lightweight AR glass is the restriction to the inclusion of multiple cameras for extensive field of view in hand tracking. This constraint underscores the need for an additional input device. We propose a system to address this gap: a ring-based mid-air gesture typing technique, RingGesture, utilizing electrodes to mark the start and end of gesture trajectories and inertial measurement units (IMU) sensors for hand tracking. This method offers an intuitive experience similar to raycast-based mid-air gesture typing found in VR headsets, allowing for a seamless translation of hand movements into cursor navigation. To enhance both accuracy and input speed, we propose a novel deep-learning word prediction framework, Score Fusion, comprised of three key components: a) a word-gesture decoding model, b) a spatial spelling correction model, and c) a lightweight contextual language model. In contrast, this framework fuses the scores from the three models to predict the most likely words with higher precision. We conduct comparative and longitudinal studies to demonstrate two key findings: firstly, the overall effectiveness of RingGesture, which achieves an average text entry speed of 27.3 words per minute (WPM) and a peak performance of 47.9 WPM. Secondly, we highlight the superior performance of the Score Fusion framework, which offers a 28.2% improvement in uncorrected Character Error Rate over a conventional word prediction framework, Naive Correction, leading to a 55.2% improvement in text entry speed for RingGesture. Additionally, RingGesture received a System Usability Score of 83 signifying its excellent usability.

RingGesture: A Ring-Based Mid-Air Gesture Typing System Powered by a Deep-Learning Word Prediction Framework

TL;DR

A ring-based mid-air gesture typing technique, RingGesture, utilizing electrodes to mark the start and end of gesture trajectories and inertial measurement units (IMU) sensors for hand tracking is proposed, offering an intuitive experience similar to raycast-based mid-air gesture typing found in VR headsets.

Abstract

Text entry is a critical capability for any modern computing experience, with lightweight augmented reality (AR) glasses being no exception. Designed for all-day wearability, a limitation of lightweight AR glass is the restriction to the inclusion of multiple cameras for extensive field of view in hand tracking. This constraint underscores the need for an additional input device. We propose a system to address this gap: a ring-based mid-air gesture typing technique, RingGesture, utilizing electrodes to mark the start and end of gesture trajectories and inertial measurement units (IMU) sensors for hand tracking. This method offers an intuitive experience similar to raycast-based mid-air gesture typing found in VR headsets, allowing for a seamless translation of hand movements into cursor navigation. To enhance both accuracy and input speed, we propose a novel deep-learning word prediction framework, Score Fusion, comprised of three key components: a) a word-gesture decoding model, b) a spatial spelling correction model, and c) a lightweight contextual language model. In contrast, this framework fuses the scores from the three models to predict the most likely words with higher precision. We conduct comparative and longitudinal studies to demonstrate two key findings: firstly, the overall effectiveness of RingGesture, which achieves an average text entry speed of 27.3 words per minute (WPM) and a peak performance of 47.9 WPM. Secondly, we highlight the superior performance of the Score Fusion framework, which offers a 28.2% improvement in uncorrected Character Error Rate over a conventional word prediction framework, Naive Correction, leading to a 55.2% improvement in text entry speed for RingGesture. Additionally, RingGesture received a System Usability Score of 83 signifying its excellent usability.

Paper Structure

This paper contains 25 sections, 2 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: 2D cursor control from arm and wrist. The cursor controlled by a ring is a direct mapping from $\theta$, and the cursor controlled by a wristband is a direct mapping from $\phi$.
  • Figure 2: The conventional word prediction framework, Naive Correction, operates in a sequential fashion: The predictions from the word-gesture decoding model are firstly corrected for misspellings by the edit-distance-based spelling correction model, which functions by calculating the edit distance with the word candidates in the corpus. The corrected candidates are then re-ranked by an N-Gram language model jurafsky2019speech based on the previously enter N-1 words.
  • Figure 3: Violin plots of answers to subjective rating questions scored on 5-point Likert scales. Violin plots are modified box plots that add estimated kernel density plots to the summary statistics displayed by box plots. The 5-point Likert scales ranged from 1 (strongly disagree) to 5 (strongly agree).
  • Figure 4: Our novel deep-learning word prediction framework, Score Fusion, operates in an integrated fusion process: This fusion process evaluates each word suggestion by considering its initial decoding score, its likelihood of being a spatial spelling correction, and its contextual relevance. The resulting blended score aims to ensure that the final suggestions are derived from an accurate word-gesture decoding model while also being enhanced for typographical precision, keyboard-layout-awareness, and contextual relevance.
  • Figure 5: The novel process of converting word-gesture trajectory data from one keyboard layout to another. It demonstrates an example of a trajectory (for the word 'available') that undergoes both temporal and spatial transformations. The keyboard layouts vary in terms of key spacing, bottom row shifts, and the presence or absence of the apostrophe key.
  • ...and 6 more figures