Table of Contents
Fetching ...

Preventing Conflicting Gradients in Neural Marked Temporal Point Processes

Tanguy Bosser, Souhaib Ben Taieb

TL;DR

This work identifies a fundamental issue in neural marked temporal point processes: when time and mark predictions share parameters, training can yield conflicting gradients that hamper learning. It introduces disjoint parametrizations that separate time and mark modeling (via distinct time/mark encoders and decoupled decoders), enabling independent optimization while preserving dependency through a time-conditioned mark distribution. Empirical results across five real-world datasets show consistent gains in both time and mark predictions, with particularly strong improvements in mark-prediction calibration and accuracy. The framework yields better-calibrated predictive distributions and demonstrates that preventing gradient conflicts at the root can outperform gradient-surgery-based fixes, with broader implications for multi-task learning in sequential models.

Abstract

Neural Marked Temporal Point Processes (MTPP) are flexible models to capture complex temporal inter-dependencies between labeled events. These models inherently learn two predictive distributions: one for the arrival times of events and another for the types of events, also known as marks. In this study, we demonstrate that learning a MTPP model can be framed as a two-task learning problem, where both tasks share a common set of trainable parameters that are optimized jointly. We show that this often leads to the emergence of conflicting gradients during training, where task-specific gradients are pointing in opposite directions. When such conflicts arise, following the average gradient can be detrimental to the learning of each individual tasks, resulting in overall degraded performance. To overcome this issue, we introduce novel parametrizations for neural MTPP models that allow for separate modeling and training of each task, effectively avoiding the problem of conflicting gradients. Through experiments on multiple real-world event sequence datasets, we demonstrate the benefits of our framework compared to the original model formulations.

Preventing Conflicting Gradients in Neural Marked Temporal Point Processes

TL;DR

This work identifies a fundamental issue in neural marked temporal point processes: when time and mark predictions share parameters, training can yield conflicting gradients that hamper learning. It introduces disjoint parametrizations that separate time and mark modeling (via distinct time/mark encoders and decoupled decoders), enabling independent optimization while preserving dependency through a time-conditioned mark distribution. Empirical results across five real-world datasets show consistent gains in both time and mark predictions, with particularly strong improvements in mark-prediction calibration and accuracy. The framework yields better-calibrated predictive distributions and demonstrates that preventing gradient conflicts at the root can outperform gradient-surgery-based fixes, with broader implications for multi-task learning in sequential models.

Abstract

Neural Marked Temporal Point Processes (MTPP) are flexible models to capture complex temporal inter-dependencies between labeled events. These models inherently learn two predictive distributions: one for the arrival times of events and another for the types of events, also known as marks. In this study, we demonstrate that learning a MTPP model can be framed as a two-task learning problem, where both tasks share a common set of trainable parameters that are optimized jointly. We show that this often leads to the emergence of conflicting gradients during training, where task-specific gradients are pointing in opposite directions. When such conflicts arise, following the average gradient can be detrimental to the learning of each individual tasks, resulting in overall degraded performance. To overcome this issue, we introduce novel parametrizations for neural MTPP models that allow for separate modeling and training of each task, effectively avoiding the problem of conflicting gradients. Through experiments on multiple real-world event sequence datasets, we demonstrate the benefits of our framework compared to the original model formulations.

Paper Structure

This paper contains 29 sections, 1 theorem, 27 equations, 15 figures, 10 tables.

Key Result

Corollary 1

Assume that$\mathcal{L}_T$and$\mathcal{L}_M$are differentiable, and that the learning rate$\alpha$is sufficiently small. If$\text{cos } \phi_{TM} < 0$, then$\mathcal{L}(\{\boldsymbol{\theta}_T^{s+1}, \boldsymbol{\theta}_M^{s+1}\}) < \mathcal{L}(\boldsymbol{\theta}^{s+1})$.

Figures (15)

  • Figure 1: Conflicting gradients
  • Figure 2: Distribution of $\text{cos } \phi_{TM}$ during training for the different baselines on MOOC and LastFM. CG refers to the proportion of $\text{cos } \phi_{TM} < 0$ observed during training. The distribution is obtained by pooling the values of $\phi_{TM}$ over 5 training runs, and gradients that are conflicting correspond to the red bars.
  • Figure 3: Graphical representation of the base, "+", and "++" setups.
  • Figure 4: Validation curves of the $\mathcal{L}_T$ and $\mathcal{L}_M$ components for SAHP++ on MOOC.
  • Figure 5: Distribution of $\text{cos } \phi_{TM}$ during training at the encoder (ENC) and decoder (DEC) heads for THP, SAHP and FNN in the base and base+ setup on LastFM and MOOC. "B" and "+" refer to the base and base+ models, respectively, and the distribution is obtained by pooling the values of $\phi_{TM}$ over 5 training runs. As the decoders are disjoint in the base+ setting, note that $\text{cos } \phi_{TM}$ is not defined.
  • ...and 10 more figures

Theorems & Definitions (2)

  • Corollary 1
  • proof