HiMAE: Hierarchical Masked Autoencoders Discover Resolution-Specific Structure in Wearable Time Series
Simon A. Lee, Cyrus Tanade, Hao Zhou, Juhyeon Lee, Megha Thukral, Minji Han, Rachel Choi, Md Sazzad Hissain Khan, Baiying Lu, Migyeong Gwak, Mehrab Bin Morshed, Viswam Nathan, Md Mahbubur Rahman, Li Zhu, Subramaniam Venkatraman, Sharanya Arcot Desai
TL;DR
HiMAE addresses the question of how temporal resolution governs predictive utility in wearable time series by introducing a hierarchical masked autoencoder that produces multi-resolution embeddings. By coupling patch masking with a U-Net–style encoder–decoder, HiMAE explicitly models multiple temporal scales and exposes resolution-specific signals via per-layer linear probes, turning resolution into a diagnostic tool. Trained on roughly 80,000 hours of PPG from tens of thousands of participants, HiMAE achieves state-of-the-art performance across generative, classification, and regression benchmarks while remaining orders of magnitude smaller than transformer-based foundation models, enabling true on-device inference on smartwatch hardware. The results support the resolution hypothesis, show robust cross-task alignment of informative scales, and offer a practical pathway for privacy-preserving, edge-ready physiological intelligence with interpretable scale-specific structure.
Abstract
Wearable sensors provide abundant physiological time series, yet the principles governing their predictive utility remain unclear. We hypothesize that temporal resolution is a fundamental axis of representation learning, with different clinical and behavioral outcomes relying on structure at distinct scales. To test this resolution hypothesis, we introduce HiMAE (Hierarchical Masked Autoencoder), a self supervised framework that combines masked autoencoding with a hierarchical convolutional encoder decoder. HiMAE produces multi resolution embeddings that enable systematic evaluation of which temporal scales carry predictive signal, transforming resolution from a hyperparameter into a probe for interpretability. Across classification, regression, and generative benchmarks, HiMAE consistently outperforms state of the art foundation models that collapse scale, while being orders of magnitude smaller. HiMAE is an efficient representation learner compact enough to run entirely on watch, achieving sub millisecond inference on smartwatch class CPUs for true edge inference. Together, these contributions position HiMAE as both an efficient self supervised learning method and a discovery tool for scale sensitive structure in wearable health.
