Finger in Camera Speaks Everything: Unconstrained Air-Writing for Real-World

Meiqi Wu; Kaiqi Huang; Yuanqiang Cai; Shiyu Hu; Yuzhong Zhao; Weiqiang Wang

Finger in Camera Speaks Everything: Unconstrained Air-Writing for Real-World

Meiqi Wu, Kaiqi Huang, Yuanqiang Cai, Shiyu Hu, Yuzhong Zhao, Weiqiang Wang

TL;DR

This work introduces AWCV-100K-UCAS2024, the first large-scale video-based, logogram-focused air-writing dataset captured with general RGB cameras, featuring 8.8 million frames across 3,755 GB1 Chinese characters. To address sparse visual cues in real-world data, it proposes VCRec, a two-stage model that first derives fingertip features from fingertip trajectories and then uses a spatio-temporal sequence module with StrokeGAT to capture both temporal dynamics and character structure. Empirical results on AWCV-100K-UCAS2024 show that VCRec outperforms existing video-based air-writing methods by a substantial margin (e.g., 4.92% absolute improvement) and demonstrates robustness across diverse environments, hand sizes, and lighting conditions. The dataset and baseline code are intended to accelerate research and enable practical air-writing interfaces on everyday devices like laptops and smartphones, advancing natural, hands-free human–computer interaction in real-world settings.

Abstract

Air-writing is a challenging task that combines the fields of computer vision and natural language processing, offering an intuitive and natural approach for human-computer interaction. However, current air-writing solutions face two primary challenges: (1) their dependency on complex sensors (e.g., Radar, EEGs and others) for capturing precise handwritten trajectories, and (2) the absence of a video-based air-writing dataset that covers a comprehensive vocabulary range. These limitations impede their practicality in various real-world scenarios, including the use on devices like iPhones and laptops. To tackle these challenges, we present the groundbreaking air-writing Chinese character video dataset (AWCV-100K-UCAS2024), serving as a pioneering benchmark for video-based air-writing. This dataset captures handwritten trajectories in various real-world scenarios using commonly accessible RGB cameras, eliminating the need for complex sensors. AWCV-100K-UCAS2024 includes 8.8 million video frames, encompassing the complete set of 3,755 characters from the GB2312-80 level-1 set (GB1). Furthermore, we introduce our baseline approach, the video-based character recognizer (VCRec). VCRec adeptly extracts fingertip features from sparse visual cues and employs a spatio-temporal sequence module for analysis. Experimental results showcase the superior performance of VCRec compared to existing models in recognizing air-written characters, both quantitatively and qualitatively. This breakthrough paves the way for enhanced human-computer interaction in real-world contexts. Moreover, our approach leverages affordable RGB cameras, enabling its applicability in a diverse range of scenarios. The code and data examples will be made public at https://github.com/wmeiqi/AWCV.

Finger in Camera Speaks Everything: Unconstrained Air-Writing for Real-World

TL;DR

Abstract

Paper Structure (22 sections, 4 equations, 12 figures, 11 tables)

This paper contains 22 sections, 4 equations, 12 figures, 11 tables.

Introduction
Related Works
Air-Writing Datasets
Air-Writing Recognition Models
Fingertip Detection and Tracking
AWCV-100K-UCAS2024 Dataset
Data Collection
Checkout Flow
Challenge Attributes
Dataset Comparison.
Evaluation Protocol
Methodology
Overview
Fingertip Feature Extractor
Spatio-Temporal Sequence Module
...and 7 more sections

Figures (12)

Figure 1: Comparing Our Work and Conventional Air-Writing in Real-World Scenarios. Conventional air-writing relies on accurately captured handwritten trajectories by complex sensors (such as Radar 9820766, Smart Watch 10150241, Leap Motion gan2018unified, EEGs tripathi2023neuroair, IMU zhang2022wearable), which impose significant limitations for real-world scenarios ($e.g.$, VR/AR/MR, iPhone, metaverse, GPT series brown2020language and others). Mainstream real-world devices only incorporate standard RGB cameras and require coverage of commonly used words for communication purposes. To address these challenges, we propose a video-based air-writing dataset with a comprehensive corpus (covering 99.7% of daily-used characters), AWCV-100K-UCAS2024, captured by general cameras, and propose a VCRec for sparse visual features and complex character structures.
Figure 2: Comparison between AWCV-100K-UCAS2024 with other benchmarks. Phonogram-based (e.g., VBFR jin2007novel, VBHR schick2012vision, AWR chen2015air, WiFi fu2018writing, FDT mukherjee2019fingertip, WiTA kim2022writing) and logogram-based benchmarks are selected for overall comparison. The bubble diameter is proportional to the total frames of the benchmark, and the vertical represents the coverage rate of daily-used characters in each benchmark. Obviously, the proposed AWCV-100K-UCAS2024 is the first logogram-based video dataset with a comprehensive corpus, more participants, and more complex characters.
Figure 3: Examples of AWCV-100K-UCAS2024. The figure shows a comparison of data under different lighting intensities and backgrounds. On the left of the figure are the video frames of datasets, and on the right are the labels of datasets. The blue circles represent overexposure due to strong illumination and the red box represents motion blur. (TOP) Data is collected under complex backgrounds and strong lighting conditions. (BOTTOM) Data is collected under simpler backgrounds and weaker lighting conditions.
Figure 4: More Complex and Comprehensive Corpus. The figure shows the characteristics of AWCV-100K-UCAS2024. (A) Phonograms are composed of letters and easy to identify (i.e., "ship" consists of the letters "s,h,i,p" ). Logograms are described as having one pinyin corresponds to multiple characters. Each character is made up of parts, which can be further divided into strokes (i.e., pinyin "chuan" with yellow numbers representing the stroke order). (B) Successive frames represent a stroke (i.e., the first two frames of the video correspond to the first stroke in figure), which are collected by general camera under cluttered background and natural light. (C)The stroke distributions of GB1 and AWCV-100K-UCAS2024 respectively.
Figure 5: Statistical Analysis Environments in AWCV-100K-UCAS2024. (A) Depicts the diverse environmental distributions within AWCV-100K-UCAS2024. (B) This section illustrates the environmental combinations of two backgrounds (i.e., neat background and cluttered background) and three types of light (i.e., natural light, artificial light, and their hybrids).
...and 7 more figures

Finger in Camera Speaks Everything: Unconstrained Air-Writing for Real-World

TL;DR

Abstract

Finger in Camera Speaks Everything: Unconstrained Air-Writing for Real-World

Authors

TL;DR

Abstract

Table of Contents

Figures (12)