Table of Contents
Fetching ...

AuthGlass: Benchmarking Voice Liveness Detection and Authentication on Smart Glasses via Comprehensive Acoustic Features

Weiye Xu, Zhang Jiang, Siqi Zheng, Xiyuxing Zhang, Changhao Zhang, Jian Liu, Weiqiang Wang, Yuntao Wang

TL;DR

AuthGlass tackles the security gap in voice-based interaction on smart glasses by introducing a public, high-resolution, multi-channel dataset and hardware platform. It presents AuthG-Live, a sound-field based liveness detector, and AuthG-Net, a multi-acoustic-modal authentication model that fuses AC, BC, and SF cues for robust user verification. Across four benchmark tasks, the approach achieves state-of-the-art performance and demonstrates strong generalization to unseen attacks and cross-utterance scenarios, with ablations showing resilience under reduced modalities and commercial-device configurations. The work provides practical design insights for microphone layout and enables broad future research through open data and hardware resources.

Abstract

With the rapid advancement of smart glasses, voice interaction has been widely adopted due to its naturalness and convenience. However, its practical deployment is often undermined by vulnerability to spoofing attacks, while no public dataset currently exists for voice liveness detection and authentication in smart-glasses scenarios. To address this challenge, we first collect a multi-acoustic-modal dataset comprising 16-channel audio data from 42 subjects, along with corresponding attack samples covering two attack categories. Based on insights derived from this collected data, we propose AuthG-Live, a sound-field-based voice liveness detection method, and AuthG-Net, a multi-acoustic-modal authentication model. We further benchmark seven voice liveness detection methods and four authentication methods across diverse acoustic modalities. The results demonstrate that our proposed approach achieves state-of-the-art performance on four benchmark tasks, and extensive ablation studies validate the generalizability of our methods across different modality combinations. Finally, we release this dataset, termed AuthGlass, to facilitate future research on voice liveness detection and authentication for smart glasses.

AuthGlass: Benchmarking Voice Liveness Detection and Authentication on Smart Glasses via Comprehensive Acoustic Features

TL;DR

AuthGlass tackles the security gap in voice-based interaction on smart glasses by introducing a public, high-resolution, multi-channel dataset and hardware platform. It presents AuthG-Live, a sound-field based liveness detector, and AuthG-Net, a multi-acoustic-modal authentication model that fuses AC, BC, and SF cues for robust user verification. Across four benchmark tasks, the approach achieves state-of-the-art performance and demonstrates strong generalization to unseen attacks and cross-utterance scenarios, with ablations showing resilience under reduced modalities and commercial-device configurations. The work provides practical design insights for microphone layout and enables broad future research through open data and hardware resources.

Abstract

With the rapid advancement of smart glasses, voice interaction has been widely adopted due to its naturalness and convenience. However, its practical deployment is often undermined by vulnerability to spoofing attacks, while no public dataset currently exists for voice liveness detection and authentication in smart-glasses scenarios. To address this challenge, we first collect a multi-acoustic-modal dataset comprising 16-channel audio data from 42 subjects, along with corresponding attack samples covering two attack categories. Based on insights derived from this collected data, we propose AuthG-Live, a sound-field-based voice liveness detection method, and AuthG-Net, a multi-acoustic-modal authentication model. We further benchmark seven voice liveness detection methods and four authentication methods across diverse acoustic modalities. The results demonstrate that our proposed approach achieves state-of-the-art performance on four benchmark tasks, and extensive ablation studies validate the generalizability of our methods across different modality combinations. Finally, we release this dataset, termed AuthGlass, to facilitate future research on voice liveness detection and authentication for smart glasses.

Paper Structure

This paper contains 64 sections, 1 equation, 10 figures, 18 tables.

Figures (10)

  • Figure 1: Human speech production system diagram. (a) Structure of the vocal tract, including the vocal cords, tongue, and other articulatory organs. (b) Acoustic features related to speech production, including vibrations conducted through skin and bones, as well as the spatial propagation effects of the sound field such as time delays and energy attenuation.
  • Figure 2: This figure illustrates the air-conductive (AC) and bone-conductive (BC) microphones used for feature selection. (a) The raw audio signals and corresponding Mel-spectrograms captured by AC and BC microphones when a user speaks the phrase “check my voicemail.” The utterance is segmented into phonemes. The results show that both AC and BC microphones capture rich time–frequency characteristics. (b) Comparison of air-conducted (AC) and bone-conducted (BC) signals from different users uttering “check my voicemail,” visualized using a consistent color scheme. As highlighted by the blue boxes, AC and BC signals from the same user exhibit subtle frequency-related differences, whereas signals captured by either AC or BC microphones show pronounced inter-user variability.
  • Figure 3: This figure illustrates the computation of Energy Ratio over Time (ERT) and Time Delay over Time (TDT), along with the spatial acoustic information they capture. (a) Energy attenuation (red dashed line) and time delay (green dashed line) between a selected AC channel and the central AC channel within a short time frame. Both energy attenuation and time delay vary across different AC channels. (b) Normalized energy ratio and time delay distributions across all time frames when a user speaks the phrase “check my voicemail”, forming the ERT and TDT. The left panel shows the averaged energy ratio and time delay distributions of different users, with inter-user variations highlighted by colored regions.
  • Figure 4: This figure illustrates the computation of Energy Distribution over Frequencies (EDF) and the spatial acoustic information it captures. (a) STFT of signals recorded by microphones at different positions on the glasses. (b) For each microphone, the STFT is averaged along the time axis to obtain the energy–frequency representation. On the right, EDFs for different users uttering the phrase “check my voicemail” are shown, highlighting inter-user variations.
  • Figure 5: Workflow of enrollment, liveness detection, and authentication on smart glasses. (a) The yellow path: the enrollment process. After enrollment, an embedding is extracted and stored in the user database as a template for later comparison. (b) The green path: a successful verification, where both liveness detection and authentication are passed. (c) The pink path: user is not wearing the glasses and is rejected by liveness detection. (d) The red path: When another user wears the glasses and attempts verification, they can pass liveness detection but fail authentication. (e) The orange path: When an attacker performs replay or other spoofing attacks, the attempt is rejected by liveness detection.
  • ...and 5 more figures