Table of Contents
Fetching ...

Poster: Recognizing Hidden-in-the-Ear Private Key for Reliable Silent Speech Interface Using Multi-Task Learning

Xuefu Dong, Liqiang Xu, Lixing He, Zengyi Han, Ken Christofferson, Yifei Chen, Akihito Taya, Yuuki Nishiyama, Kaoru Sezaki

TL;DR

HEar-ID tackles privacy-preserving silent speech interfaces by jointly authenticating the user and decoding silent spellings from in-ear audio signals. It introduces a CLWUM-based contrastive learning framework within a multi-task architecture to align genuine whisper-ultrasonic pairs while enabling word-level spelling via a CTC decoder. Preliminary results with 11 participants show robust authentication (low FPR around 3%) and competitive spelling performance, with mean Top-1 accuracy of 67.3% and up to 90.25% for eight users. The work demonstrates the feasibility of secure, hands-free interaction on consumer earbud hardware by coupling private-key style embeddings with silent-speech interfaces.

Abstract

Silent speech interface (SSI) enables hands-free input without audible vocalization, but most SSI systems do not verify speaker identity. We present HEar-ID, which uses consumer active noise-canceling earbuds to capture low-frequency "whisper" audio and high-frequency ultrasonic reflections. Features from both streams pass through a shared encoder, producing embeddings that feed a contrastive branch for user authentication and an SSI head for silent spelling recognition. This design supports decoding of 50 words while reliably rejecting impostors, all on commodity earbuds with a single model. Experiments demonstrate that HEar-ID achieves strong spelling accuracy and robust authentication.

Poster: Recognizing Hidden-in-the-Ear Private Key for Reliable Silent Speech Interface Using Multi-Task Learning

TL;DR

HEar-ID tackles privacy-preserving silent speech interfaces by jointly authenticating the user and decoding silent spellings from in-ear audio signals. It introduces a CLWUM-based contrastive learning framework within a multi-task architecture to align genuine whisper-ultrasonic pairs while enabling word-level spelling via a CTC decoder. Preliminary results with 11 participants show robust authentication (low FPR around 3%) and competitive spelling performance, with mean Top-1 accuracy of 67.3% and up to 90.25% for eight users. The work demonstrates the feasibility of secure, hands-free interaction on consumer earbud hardware by coupling private-key style embeddings with silent-speech interfaces.

Abstract

Silent speech interface (SSI) enables hands-free input without audible vocalization, but most SSI systems do not verify speaker identity. We present HEar-ID, which uses consumer active noise-canceling earbuds to capture low-frequency "whisper" audio and high-frequency ultrasonic reflections. Features from both streams pass through a shared encoder, producing embeddings that feed a contrastive branch for user authentication and an SSI head for silent spelling recognition. This design supports decoding of 50 words while reliably rejecting impostors, all on commodity earbuds with a single model. Experiments demonstrate that HEar-ID achieves strong spelling accuracy and robust authentication.

Paper Structure

This paper contains 16 sections, 5 equations, 2 figures.

Figures (2)

  • Figure 1: The three-fold workflow of HEar-ID.
  • Figure 2: (a) Word inference accuracy, and (b) user authentication performance.