Speech Foundation Models Generalize to Time Series Tasks from Wearable Sensor Data

Jaya Narain; Zakaria Aldeneh; Shirley Ren

Speech Foundation Models Generalize to Time Series Tasks from Wearable Sensor Data

Jaya Narain, Zakaria Aldeneh, Shirley Ren

TL;DR

Speech foundation models trained on large speech corpora can generalize to wearable sensor time-series tasks when used as frozen feature extractors with lightweight probes. The approach leverages HuBERT and wav2vec 2.0 to produce embeddings from sensor signals via upsampling, and trains simple probes to map to activity, arrhythmia, and mood labels. The study finds that early convolutional encoder layers provide the most transferrable information, and the resulting cross-domain representations often outperform in-domain self-supervised baselines, particularly in data-scarce settings. This work demonstrates a step toward unified time-series modeling across speech and sensor modalities, with implications for efficient multi-modal systems.

Abstract

Both speech and sensor time series data encode information in both the time- and frequency- domains, like spectral powers and waveform shapelets. We show that speech foundation models learn representations that generalize beyond the speech domain and achieve state-of-the-art performance on diverse time-series tasks from wearable sensors. Probes trained on features extracted from HuBERT and wav2vec 2.0 outperform those extracted from self-supervised models trained directly on modality-specific datasets for mood classification, arrhythmia detection, and activity classification tasks. We find that the convolutional feature encoders of speech models are particularly relevant for wearable sensor applications. The proposed approach enhances performance on data-scarce time-series tasks using simple probing methods. This work takes a step toward developing generalized time-series models that unify speech and sensor modalities.

Speech Foundation Models Generalize to Time Series Tasks from Wearable Sensor Data

TL;DR

Abstract

Speech Foundation Models Generalize to Time Series Tasks from Wearable Sensor Data

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)