Self-Supervised Contrastive Pre-Training for Multivariate Point Processes

Xiao Shou; Dharmashankar Subramanian; Debarun Bhattacharjya; Tian Gao; Kristin P. Bennet

Self-Supervised Contrastive Pre-Training for Multivariate Point Processes

Xiao Shou, Dharmashankar Subramanian, Debarun Bhattacharjya, Tian Gao, Kristin P. Bennet

TL;DR

This work introduces Event-former, a transformer-based self-supervised paradigm for multivariate temporal point processes. It advances representation learning by inserting void epochs to capture absence of events, employing a masked event pretraining objective with combined position and time encodings, and adding a contrastive component between real and void instances. The pretraining enables fine-tuning on small downstream datasets, delivering up to 20% improvements in next-event time and type prediction over state-of-the-art baselines on synthetic and real data. Ablation studies confirm the importance of void events, the MEM masking strategy, and the combined PE+TE encoding, demonstrating strong transfer performance across finance, e-commerce, and political-domain datasets.

Abstract

Self-supervision is one of the hallmarks of representation learning in the increasingly popular suite of foundation models including large language models such as BERT and GPT-3, but it has not been pursued in the context of multivariate event streams, to the best of our knowledge. We introduce a new paradigm for self-supervised learning for multivariate point processes using a transformer encoder. Specifically, we design a novel pre-training strategy for the encoder where we not only mask random event epochs but also insert randomly sampled "void" epochs where an event does not occur; this differs from the typical discrete-time pretext tasks such as word-masking in BERT but expands the effectiveness of masking to better capture continuous-time dynamics. To improve downstream tasks, we introduce a contrasting module that compares real events to simulated void instances. The pre-trained model can subsequently be fine-tuned on a potentially much smaller event dataset, similar conceptually to the typical transfer of popular pre-trained language models. We demonstrate the effectiveness of our proposed paradigm on the next-event prediction task using synthetic datasets and 3 real applications, observing a relative performance boost of as high as up to 20% compared to state-of-the-art models.

Self-Supervised Contrastive Pre-Training for Multivariate Point Processes

TL;DR

Abstract

Paper Structure (35 sections, 1 theorem, 8 equations, 5 figures, 5 tables)

This paper contains 35 sections, 1 theorem, 8 equations, 5 figures, 5 tables.

Introduction
Background and Related Work
Temporal Point Processes
Transformers for Event Data
Self-supervision for Sequence Data
A Self-supervised Learning Paradigm
Void Events in Transformers
Void Events as Fake Epochs.
Void Events as Synthetic Noises.
Masking Strategy & Input Encoding
Temporal Lower Triangular Attention
Pre-training Scheme
Fine-tuning
Experiments
Baselines.
...and 20 more sections

Key Result

proposition 1

Transformers with combined PE and TE are universal approximators for any continuous sequence-to-sequence function with compact domain, i.e. they approximate any continuous functions f: $\mathbf{X}$$\rightarrow$$\mathbf{H}$ with $\epsilon$ error w.r.t $p$-norm where $1 \le p < \infty$ and $\mathbf{X

Figures (5)

Figure 1: An example of real-world decentralized finance transactions from 3 users where their time-stamped actions are events displayed using colored markers. The 6 types of events are: borrow (red), repay (yellow), liquidation (green), deposit (blue), redeem (purple) and swap (pink).
Figure 2: Pre-training and fine-tuning with Event-former. In pre-training, void events are first sampled and inserted to an event sequence $A$, randomly masked, then embedded (with combined positional and temporal encoding) and fed into a transformer network ($B$ blocks of attention). Pre-training is done by minimizing prediction error (Eq. \ref{['eqn:sample']}) and contrastive loss (Eq. \ref{['eqn:contrastive']}). Learned event representations for sequence $S$, $\hat{H}_i$'s are used for fine tuning with a small feed forward neural network by minimizing prediction loss (Eq. \ref{['eqn:fine']}).
Figure 3: t-SNE projection of learned representations of Hawkes-Exp and PGEM streams with pre-training on models A, B, C, D, E and F(Hawkes-Exp only) together.
Figure 4: The effect of combined TE + PE in training.
Figure :

Theorems & Definitions (1)

proposition 1

Self-Supervised Contrastive Pre-Training for Multivariate Point Processes

TL;DR

Abstract

Self-Supervised Contrastive Pre-Training for Multivariate Point Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (1)