ActSonic: Recognizing Everyday Activities from Inaudible Acoustic Wave Around the Body

Saif Mahmud; Vineet Parikh; Qikang Liang; Ke Li; Ruidong Zhang; Ashwin Ajit; Vipin Gunda; Devansh Agarwal; François Guimbretière; Cheng Zhang

ActSonic: Recognizing Everyday Activities from Inaudible Acoustic Wave Around the Body

Saif Mahmud, Vineet Parikh, Qikang Liang, Ke Li, Ruidong Zhang, Ashwin Ajit, Vipin Gunda, Devansh Agarwal, François Guimbretière, Cheng Zhang

TL;DR

An intelligent, low-power active acoustic sensing system integrated into eyeglasses that can recognize 27 different everyday activities from inaudible acoustic waves around the body from inaudible acoustic waves around the body is presented.

Abstract

We present ActSonic, an intelligent, low-power active acoustic sensing system integrated into eyeglasses that can recognize 27 different everyday activities (e.g., eating, drinking, toothbrushing) from inaudible acoustic waves around the body. It requires only a pair of miniature speakers and microphones mounted on each hinge of the eyeglasses to emit ultrasonic waves, creating an acoustic aura around the body. The acoustic signals are reflected based on the position and motion of various body parts, captured by the microphones, and analyzed by a customized self-supervised deep learning framework to infer the performed activities on a remote device such as a mobile phone or cloud server. ActSonic was evaluated in user studies with 19 participants across 19 households to track its efficacy in everyday activity recognition. Without requiring any training data from new users (leave-one-participant-out evaluation), ActSonic detected 27 activities, achieving an average F1-score of 86.6% in fully unconstrained scenarios and 93.4% in prompted settings at participants' homes.

ActSonic: Recognizing Everyday Activities from Inaudible Acoustic Wave Around the Body

TL;DR

Abstract

Paper Structure (51 sections, 4 equations, 12 figures, 3 tables)

This paper contains 51 sections, 4 equations, 12 figures, 3 tables.

INTRODUCTION
RELATED WORK
IMU-based Human Activity Recognition
Vision-based and Multimodal Human Activity Recognition
Acoustic Sensing-based Human Activity Recognition
DESIGN AND IMPLEMENTATION OF SENSING SYSTEM
Configuration of Active Acoustic Signal
Computation of Echo Profile and Acoustic Flow
Hardware Implementation and Wearable Form Factor
DEEP LEARNING FRAMEWORK
Self-supervised Learning Pipeline
Pretraining Task
Fine-tuning
Training and Implementation
Evaluation Metric
...and 36 more sections

Figures (12)

Figure 1: Overview of the active acoustic sensing principle of ActSonic: The $x$-axis of the echo frames (in the 2nd row) represents the distance of echo reception. The corresponding video frames (in the first row) serve as activity references. The echo profile, created by stacking multiple echo frames, provides a spatiotemporal representation of the activity. These sliding windows with a duration of $2$ seconds of echo profiles (in the 3rd row) serve as inputs for the self-supervised learning algorithm.
Figure 2: Overview of echo profile and acoustic flow calculation. For the echo profile, we cross-correlate the transmitted signal with a bandpass filter applied over the received signal (to ensure only specific frequencies are returned). This allows us to capture the direct echo profile, and we can calculate acoustic flow by taking the difference between two consecutive echo profiles.
Figure 3: Hardware of ActSonic: (a) Eyeglasses form factor, (b) Transmitter or speaker, (c) Receiver or microphone (dimension of the sensor board of (b) and (c) is $9 mm \times 9 mm$), (d) Front (d.1) and back (d.2) of customized PCB board (dimension $18 mm \times 23 mm$) with low-power nRF52840 micro-controller, (e) User wearing ActSonic eyeglasses form factor
Figure 4: Deep learning model architecture for ActSonic. Within the self-supervised pretraining stage, we mask out specific sections of the input echo profile and train an encoder-decoder architecture to reconstruct the input echo profile (given a lightweight decoder) supervised by an MSE loss. We then fine-tune the trained encoder from this step along with a lightweight classifier on the labeled dataset.
Figure 5: Distribution of participant schedules for the user study over time, where $x$-axis represents time and $y$-axis represents participant count. We split the participants into three general groups ("morning" as 7 am - 12 pm, "afternoon" as 12 pm - 6 pm, and "evening" as 6 pm - 11 pm) and ensure that we get a mixture of data across different times of day
...and 7 more figures

ActSonic: Recognizing Everyday Activities from Inaudible Acoustic Wave Around the Body

TL;DR

Abstract

ActSonic: Recognizing Everyday Activities from Inaudible Acoustic Wave Around the Body

Authors

TL;DR

Abstract

Table of Contents

Figures (12)