Table of Contents
Fetching ...

HiMAE: Hierarchical Masked Autoencoders Discover Resolution-Specific Structure in Wearable Time Series

Simon A. Lee, Cyrus Tanade, Hao Zhou, Juhyeon Lee, Megha Thukral, Minji Han, Rachel Choi, Md Sazzad Hissain Khan, Baiying Lu, Migyeong Gwak, Mehrab Bin Morshed, Viswam Nathan, Md Mahbubur Rahman, Li Zhu, Subramaniam Venkatraman, Sharanya Arcot Desai

TL;DR

HiMAE addresses the question of how temporal resolution governs predictive utility in wearable time series by introducing a hierarchical masked autoencoder that produces multi-resolution embeddings. By coupling patch masking with a U-Net–style encoder–decoder, HiMAE explicitly models multiple temporal scales and exposes resolution-specific signals via per-layer linear probes, turning resolution into a diagnostic tool. Trained on roughly 80,000 hours of PPG from tens of thousands of participants, HiMAE achieves state-of-the-art performance across generative, classification, and regression benchmarks while remaining orders of magnitude smaller than transformer-based foundation models, enabling true on-device inference on smartwatch hardware. The results support the resolution hypothesis, show robust cross-task alignment of informative scales, and offer a practical pathway for privacy-preserving, edge-ready physiological intelligence with interpretable scale-specific structure.

Abstract

Wearable sensors provide abundant physiological time series, yet the principles governing their predictive utility remain unclear. We hypothesize that temporal resolution is a fundamental axis of representation learning, with different clinical and behavioral outcomes relying on structure at distinct scales. To test this resolution hypothesis, we introduce HiMAE (Hierarchical Masked Autoencoder), a self supervised framework that combines masked autoencoding with a hierarchical convolutional encoder decoder. HiMAE produces multi resolution embeddings that enable systematic evaluation of which temporal scales carry predictive signal, transforming resolution from a hyperparameter into a probe for interpretability. Across classification, regression, and generative benchmarks, HiMAE consistently outperforms state of the art foundation models that collapse scale, while being orders of magnitude smaller. HiMAE is an efficient representation learner compact enough to run entirely on watch, achieving sub millisecond inference on smartwatch class CPUs for true edge inference. Together, these contributions position HiMAE as both an efficient self supervised learning method and a discovery tool for scale sensitive structure in wearable health.

HiMAE: Hierarchical Masked Autoencoders Discover Resolution-Specific Structure in Wearable Time Series

TL;DR

HiMAE addresses the question of how temporal resolution governs predictive utility in wearable time series by introducing a hierarchical masked autoencoder that produces multi-resolution embeddings. By coupling patch masking with a U-Net–style encoder–decoder, HiMAE explicitly models multiple temporal scales and exposes resolution-specific signals via per-layer linear probes, turning resolution into a diagnostic tool. Trained on roughly 80,000 hours of PPG from tens of thousands of participants, HiMAE achieves state-of-the-art performance across generative, classification, and regression benchmarks while remaining orders of magnitude smaller than transformer-based foundation models, enabling true on-device inference on smartwatch hardware. The results support the resolution hypothesis, show robust cross-task alignment of informative scales, and offer a practical pathway for privacy-preserving, edge-ready physiological intelligence with interpretable scale-specific structure.

Abstract

Wearable sensors provide abundant physiological time series, yet the principles governing their predictive utility remain unclear. We hypothesize that temporal resolution is a fundamental axis of representation learning, with different clinical and behavioral outcomes relying on structure at distinct scales. To test this resolution hypothesis, we introduce HiMAE (Hierarchical Masked Autoencoder), a self supervised framework that combines masked autoencoding with a hierarchical convolutional encoder decoder. HiMAE produces multi resolution embeddings that enable systematic evaluation of which temporal scales carry predictive signal, transforming resolution from a hyperparameter into a probe for interpretability. Across classification, regression, and generative benchmarks, HiMAE consistently outperforms state of the art foundation models that collapse scale, while being orders of magnitude smaller. HiMAE is an efficient representation learner compact enough to run entirely on watch, achieving sub millisecond inference on smartwatch class CPUs for true edge inference. Together, these contributions position HiMAE as both an efficient self supervised learning method and a discovery tool for scale sensitive structure in wearable health.

Paper Structure

This paper contains 49 sections, 17 figures, 21 tables.

Figures (17)

  • Figure 1: HiMAE pre-training and evaluation pipeline. (1) Physiological sequences are split into temporal patches. (2) Selected patches are masked randomly or contiguously. (3) A U-Net–style CNN encoder–decoder reconstructs missing values, with loss applied only to masked regions. (4) Multi-resolution embeddings feed linear probes for classification and regression benchmarking. (5) Three categorized task-lists are evaluated.
  • Figure 2: HiMAE is lightweight compared to other methods proposed in the literature.
  • Figure 3: HiMAE exhibits superior scaling across axes. Mean squared error decreases most rapidly for HiMAE as data, participants, model size, and compute scale. Ablations without skip connections confirm that both the hierarchical design and skip pathways are helpful for generative pefromance. Grey lines indicate multiple runs whereas colored lines are average performance.
  • Figure 4: Performance on generative benchmarks. Mean squared error and $R^2$ for random imputation, temporal interpolation, and temporal extrapolation at varying missingness levels. Bold outline indicates best performing model.
  • Figure 5: AUROC across downstream tasks. Highlighted shapes indicate best performing model. HiMAE matches or outperforms foundation model baselines with far fewer parameters.
  • ...and 12 more figures