Table of Contents
Fetching ...

MLEM: Generative and Contrastive Learning as Distinct Modalities for Event Sequences

Viktor Moskvoretskii, Dmitry Osin, Egor Shvetsov, Igor Udovichenko, Maxim Zhelnin, Andrey Dukhovny, Anna Zhimerikina, Evgeny Burnaev

TL;DR

Event Sequences (EvS) pose unique SSL challenges due to irregular timing and mixed feature types. The authors compare generative, contrastive, and naive hybrid SSL methods and introduce MLEM, a multimodal approach that aligns generative and contrastive embeddings via a SigLIP-inspired alignment loss, while optimizing a generative reconstruction objective. Results show that neither pure generative nor pure contrastive pre-training consistently dominates; MLEM generally delivers superior performance across downstream tasks and embedding quality, albeit with sensitivity to data sparsity and higher computational costs. This work demonstrates the value of treating diverse SSL signals as complementary modalities for EvS, offering a practical route to more robust self-supervised representations in irregular time-series domains.

Abstract

This study explores the application of self-supervised learning techniques for event sequences. It is a key modality in various applications such as banking, e-commerce, and healthcare. However, there is limited research on self-supervised learning for event sequences, and methods from other domains like images, texts, and speech may not easily transfer. To determine the most suitable approach, we conduct a detailed comparative analysis of previously identified best-performing methods. We find that neither the contrastive nor generative method is superior. Our assessment includes classifying event sequences, predicting the next event, and evaluating embedding quality. These results further highlight the potential benefits of combining both methods. Given the lack of research on hybrid models in this domain, we initially adapt the baseline model from another domain. However, upon observing its underperformance, we develop a novel method called the Multimodal-Learning Event Model (MLEM). MLEM treats contrastive learning and generative modeling as distinct yet complementary modalities, aligning their embeddings. The results of our study demonstrate that combining contrastive and generative approaches into one procedure with MLEM achieves superior performance across multiple metrics.

MLEM: Generative and Contrastive Learning as Distinct Modalities for Event Sequences

TL;DR

Event Sequences (EvS) pose unique SSL challenges due to irregular timing and mixed feature types. The authors compare generative, contrastive, and naive hybrid SSL methods and introduce MLEM, a multimodal approach that aligns generative and contrastive embeddings via a SigLIP-inspired alignment loss, while optimizing a generative reconstruction objective. Results show that neither pure generative nor pure contrastive pre-training consistently dominates; MLEM generally delivers superior performance across downstream tasks and embedding quality, albeit with sensitivity to data sparsity and higher computational costs. This work demonstrates the value of treating diverse SSL signals as complementary modalities for EvS, offering a practical route to more robust self-supervised representations in irregular time-series domains.

Abstract

This study explores the application of self-supervised learning techniques for event sequences. It is a key modality in various applications such as banking, e-commerce, and healthcare. However, there is limited research on self-supervised learning for event sequences, and methods from other domains like images, texts, and speech may not easily transfer. To determine the most suitable approach, we conduct a detailed comparative analysis of previously identified best-performing methods. We find that neither the contrastive nor generative method is superior. Our assessment includes classifying event sequences, predicting the next event, and evaluating embedding quality. These results further highlight the potential benefits of combining both methods. Given the lack of research on hybrid models in this domain, we initially adapt the baseline model from another domain. However, upon observing its underperformance, we develop a novel method called the Multimodal-Learning Event Model (MLEM). MLEM treats contrastive learning and generative modeling as distinct yet complementary modalities, aligning their embeddings. The results of our study demonstrate that combining contrastive and generative approaches into one procedure with MLEM achieves superior performance across multiple metrics.
Paper Structure (26 sections, 9 equations, 7 figures, 8 tables)

This paper contains 26 sections, 9 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: An example of one sequence $S_i$ with some categorical and real-valued sub-events $\{k^1,k^2, \ldots\}$.
  • Figure 2: Schematic Models Overview:A) The Contrastive Model divides sequence $S_i$ into subsequences $\{S_i', S_i"\}$, encodes them into latent vectors $\{h_i', h_i"\}$, and applies contrastive loss $L_{\text{con}}$. B) The Generative Model autoregressively reconstructs sequence $S$ cross-attended to latent vector $h$, derived from $S$ via a bottleneck encoder. Language model loss $L_{\text{LM}}$ is applied to reconstructed sequence $\hat{S}$. C) The MLEM Hybrid Model integrates generative and contrastive approaches, treating them as distinct modalities. It generates latent vectors $h_i^g$ and $h_i^c$ from a learned generative encoder and frozen, pretrained contrastive encoder respectively. An alignment loss $L_{\text{align}}$ is utilized to align vectors from the same sequence while separating those from different sequences. Language modeling loss $L_{\text{LM}}$ is applied as in generative model.
  • Figure 3: The t-SNE visualizations showcase embeddings resulting from various pre-training strategies. The left portion displays the embeddings for the Age dataset, while the bottom row illustrates those for the TaoBao dataset. Each point represents a sequence $S_i$ from a given dataset, colored accordingly to the corresponding attribute $y_i$. For Age, there are 4 classes and 2 classes for TaoBao
  • Figure 4: The figure illustrates the pendulum motion at various instances, with time steps determined by a Hawkes process. It captures the pendulum's trajectory using only the normalized planar coordinates at these sampled times.
  • Figure 5: Example of temporal distribution of events ranging from 0 to 5 seconds. Each event is marked with a star along the timeline. The y-axis serves only as a technical aid to separate the events for clarity and does not convey any additional information.
  • ...and 2 more figures