Rotary Masked Autoencoders are Versatile Learners
Uros Zivanovic, Serafina Di Gioia, Andre Scaffidi, Martín de los Rios, Gabriella Contardo, Roberto Trotta
TL;DR
RoMAE extends Masked Autoencoders with Rotary Positional Embeddings to support continuous, multi-dimensional positions, enabling effective interpolation for irregular and multivariate time-series without specialized architectural changes. The method leverages Axial RoPE and p-RoPE within an MAE-like pretraining paradigm, yielding strong performance across images, audio, and irregular time-series datasets while preserving MAE's strengths on standard modalities. A key finding is that learned [CLS] tokens enable absolute-position reconstruction, whereas omitting them reveals relative-position dynamics and translational invariance, informing how RoPE interacts with learned embeddings. The approach is demonstrated to be data-efficient, modality-agnostic, and scalable for moderate sequence lengths, offering a practical, off-the-shelf pathway to robust representation learning across diverse domains.
Abstract
Applying Transformers to irregular time-series typically requires specializations to their baseline architecture, which can result in additional computational overhead and increased method complexity. We present the Rotary Masked Autoencoder (RoMAE), which utilizes the popular Rotary Positional Embedding (RoPE) method for continuous positions. RoMAE is an extension to the Masked Autoencoder (MAE) that enables interpolation and representation learning with multidimensional continuous positional information while avoiding any time-series-specific architectural specializations. We showcase RoMAE's performance on a variety of modalities including irregular and multivariate time-series, images, and audio, demonstrating that RoMAE surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge while maintaining MAE's usual performance across other modalities. In addition, we investigate RoMAE's ability to reconstruct the embedded continuous positions, demonstrating that including learned embeddings in the input sequence breaks RoPE's relative position property.
