Table of Contents
Fetching ...

Rotary Masked Autoencoders are Versatile Learners

Uros Zivanovic, Serafina Di Gioia, Andre Scaffidi, Martín de los Rios, Gabriella Contardo, Roberto Trotta

TL;DR

RoMAE extends Masked Autoencoders with Rotary Positional Embeddings to support continuous, multi-dimensional positions, enabling effective interpolation for irregular and multivariate time-series without specialized architectural changes. The method leverages Axial RoPE and p-RoPE within an MAE-like pretraining paradigm, yielding strong performance across images, audio, and irregular time-series datasets while preserving MAE's strengths on standard modalities. A key finding is that learned [CLS] tokens enable absolute-position reconstruction, whereas omitting them reveals relative-position dynamics and translational invariance, informing how RoPE interacts with learned embeddings. The approach is demonstrated to be data-efficient, modality-agnostic, and scalable for moderate sequence lengths, offering a practical, off-the-shelf pathway to robust representation learning across diverse domains.

Abstract

Applying Transformers to irregular time-series typically requires specializations to their baseline architecture, which can result in additional computational overhead and increased method complexity. We present the Rotary Masked Autoencoder (RoMAE), which utilizes the popular Rotary Positional Embedding (RoPE) method for continuous positions. RoMAE is an extension to the Masked Autoencoder (MAE) that enables interpolation and representation learning with multidimensional continuous positional information while avoiding any time-series-specific architectural specializations. We showcase RoMAE's performance on a variety of modalities including irregular and multivariate time-series, images, and audio, demonstrating that RoMAE surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge while maintaining MAE's usual performance across other modalities. In addition, we investigate RoMAE's ability to reconstruct the embedded continuous positions, demonstrating that including learned embeddings in the input sequence breaks RoPE's relative position property.

Rotary Masked Autoencoders are Versatile Learners

TL;DR

RoMAE extends Masked Autoencoders with Rotary Positional Embeddings to support continuous, multi-dimensional positions, enabling effective interpolation for irregular and multivariate time-series without specialized architectural changes. The method leverages Axial RoPE and p-RoPE within an MAE-like pretraining paradigm, yielding strong performance across images, audio, and irregular time-series datasets while preserving MAE's strengths on standard modalities. A key finding is that learned [CLS] tokens enable absolute-position reconstruction, whereas omitting them reveals relative-position dynamics and translational invariance, informing how RoPE interacts with learned embeddings. The approach is demonstrated to be data-efficient, modality-agnostic, and scalable for moderate sequence lengths, offering a practical, off-the-shelf pathway to robust representation learning across diverse domains.

Abstract

Applying Transformers to irregular time-series typically requires specializations to their baseline architecture, which can result in additional computational overhead and increased method complexity. We present the Rotary Masked Autoencoder (RoMAE), which utilizes the popular Rotary Positional Embedding (RoPE) method for continuous positions. RoMAE is an extension to the Masked Autoencoder (MAE) that enables interpolation and representation learning with multidimensional continuous positional information while avoiding any time-series-specific architectural specializations. We showcase RoMAE's performance on a variety of modalities including irregular and multivariate time-series, images, and audio, demonstrating that RoMAE surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge while maintaining MAE's usual performance across other modalities. In addition, we investigate RoMAE's ability to reconstruct the embedded continuous positions, demonstrating that including learned embeddings in the input sequence breaks RoPE's relative position property.

Paper Structure

This paper contains 40 sections, 4 theorems, 8 equations, 8 figures, 21 tables.

Key Result

Proposition 4.1

For any irregular dimension $d_{i}$ in $\mathbf{x}$, the corresponding patch size for that dimension $p_{i}$ must be equal to 1.

Figures (8)

  • Figure 1: Overview of the RoMAE pipeline. Left: Visualisation of data embedding via multi-dimensional (ND) patchification for illustrative data realisations in 1, 2 and 3D. Centre: Full depiction of RoMAE architecture. The optional [CLS] token is omitted from the input sequence for simplicity. Right: The RoMAE encoder/decoder with ROPE operations denoted by rotational arrows.
  • Figure 2: RoMAE position reconstruction MSE across two positional ranges.
  • Figure 3: Average MSE obtained from the interpolation task using RoMAE-tiny for time-series with a single varying frequency component. Left: MSE computed on the Fast Fourier Transform (FFT). Right: MSE in the time domain. We generate 200 time-series per individual frequency, with 50 observed noisy points and 50 masked (interpolated) points, thus a limiting frequency of 25 according to Nyquist-Shannon sampling theorem. Error bars show the standard deviation of the MSE obtained for each individual frequency.
  • Figure 4: Same as above but now for time-series with two frequency modes present in the signal. Left: MSE computed on the FFT. Right: MSE in the time domain. The time-series have 50 observed noisy points and 50 masked (interpolated) points.
  • Figure 5: Illustrative realisation from the evaluation of RoMAE on a the bi-frequency time series. Left: Interpolation in the time domain for a composite sinusoidal signal with base frequencies 1 and 5 Hz. Right: FFT of the ground truth and predicted waveform.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Definition 3.1: Regular and irregular dimensions
  • Proposition 4.1
  • Proposition 4.2: Reconstructing absolute position
  • Corollary 4.1: Translational invariance in the RoMAE Encoder
  • Corollary 4.2: Effect of distance on absolute position reconstruction
  • Definition C.1: [CLS] token