Table of Contents
Fetching ...

Speech Foundation Models Generalize to Time Series Tasks from Wearable Sensor Data

Jaya Narain, Zakaria Aldeneh, Shirley Ren

TL;DR

Speech foundation models trained on large speech corpora can generalize to wearable sensor time-series tasks when used as frozen feature extractors with lightweight probes. The approach leverages HuBERT and wav2vec 2.0 to produce embeddings from sensor signals via upsampling, and trains simple probes to map to activity, arrhythmia, and mood labels. The study finds that early convolutional encoder layers provide the most transferrable information, and the resulting cross-domain representations often outperform in-domain self-supervised baselines, particularly in data-scarce settings. This work demonstrates a step toward unified time-series modeling across speech and sensor modalities, with implications for efficient multi-modal systems.

Abstract

Both speech and sensor time series data encode information in both the time- and frequency- domains, like spectral powers and waveform shapelets. We show that speech foundation models learn representations that generalize beyond the speech domain and achieve state-of-the-art performance on diverse time-series tasks from wearable sensors. Probes trained on features extracted from HuBERT and wav2vec 2.0 outperform those extracted from self-supervised models trained directly on modality-specific datasets for mood classification, arrhythmia detection, and activity classification tasks. We find that the convolutional feature encoders of speech models are particularly relevant for wearable sensor applications. The proposed approach enhances performance on data-scarce time-series tasks using simple probing methods. This work takes a step toward developing generalized time-series models that unify speech and sensor modalities.

Speech Foundation Models Generalize to Time Series Tasks from Wearable Sensor Data

TL;DR

Speech foundation models trained on large speech corpora can generalize to wearable sensor time-series tasks when used as frozen feature extractors with lightweight probes. The approach leverages HuBERT and wav2vec 2.0 to produce embeddings from sensor signals via upsampling, and trains simple probes to map to activity, arrhythmia, and mood labels. The study finds that early convolutional encoder layers provide the most transferrable information, and the resulting cross-domain representations often outperform in-domain self-supervised baselines, particularly in data-scarce settings. This work demonstrates a step toward unified time-series modeling across speech and sensor modalities, with implications for efficient multi-modal systems.

Abstract

Both speech and sensor time series data encode information in both the time- and frequency- domains, like spectral powers and waveform shapelets. We show that speech foundation models learn representations that generalize beyond the speech domain and achieve state-of-the-art performance on diverse time-series tasks from wearable sensors. Probes trained on features extracted from HuBERT and wav2vec 2.0 outperform those extracted from self-supervised models trained directly on modality-specific datasets for mood classification, arrhythmia detection, and activity classification tasks. We find that the convolutional feature encoders of speech models are particularly relevant for wearable sensor applications. The proposed approach enhances performance on data-scarce time-series tasks using simple probing methods. This work takes a step toward developing generalized time-series models that unify speech and sensor modalities.

Paper Structure

This paper contains 5 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Time-series data from modalities such as speech, accelerometer (Accel.), electrocardiogram (ECG), and photoplethysmogram (PPG) signals contain rich temporal and spectral characteristics, including frequency band powers, periodic patterns, and distinctive waveform shapelets.
  • Figure 2: Speech foundation models as feature extractors for other modalities. Time series data is fed as inputs into audio embedding models, with short segments upsampled. Task specific probes are trained on the extracted features, and used to generate predictions across time series tasks.
  • Figure 3: Performance by transformer layer with MLP probes for each task: activity classification (results shown with the PAMAP2 leg data), arrhythmia detection, and mood classification. Early layer performance is better across modalities, particularly for wav2vec 2.0.
  • Figure 4: Visualization of selection of convolutional filters from HuBERT, from the first convolutional layer in the model. The filters capture periodic and spiked shapelets, and include filters like bandpass filters and high-pass filters.