Table of Contents
Fetching ...

Discrete Event, Continuous Time RNNs

Michael C. Mozer, Denis Kazakov, Robert V. Lindsey

TL;DR

This work investigates how to process event sequences with continuous timestamps by introducing a continuous-time GRU (CT-GRU) that allocates memory across multiple fixed time scales. Despite the added temporal structure and interpretability of the CT-GRU, experiments across eleven datasets show performance essentially identical to a standard GRU with Delta t inputs, indicating robustness of GRU/LSTM approaches to time in practice. The study highlights that explicit time-scale dynamics do not confer a clear advantage on these tasks, though the CT-GRU provides valuable insights into how memory might be organized across time. The results emphasize the value of inductive biases in time-aware architectures and offer guidance for future exploration of multiscale temporal representations.

Abstract

We investigate recurrent neural network architectures for event-sequence processing. Event sequences, characterized by discrete observations stamped with continuous-valued times of occurrence, are challenging due to the potentially wide dynamic range of relevant time scales as well as interactions between time scales. We describe four forms of inductive bias that should benefit architectures for event sequences: temporal locality, position and scale homogeneity, and scale interdependence. We extend the popular gated recurrent unit (GRU) architecture to incorporate these biases via intrinsic temporal dynamics, obtaining a continuous-time GRU. The CT-GRU arises by interpreting the gates of a GRU as selecting a time scale of memory, and the CT-GRU generalizes the GRU by incorporating multiple time scales of memory and performing context-dependent selection of time scales for information storage and retrieval. Event time-stamps drive decay dynamics of the CT-GRU, whereas they serve as generic additional inputs to the GRU. Despite the very different manner in which the two models consider time, their performance on eleven data sets we examined is essentially identical. Our surprising results point both to the robustness of GRU and LSTM architectures for handling continuous time, and to the potency of incorporating continuous dynamics into neural architectures.

Discrete Event, Continuous Time RNNs

TL;DR

This work investigates how to process event sequences with continuous timestamps by introducing a continuous-time GRU (CT-GRU) that allocates memory across multiple fixed time scales. Despite the added temporal structure and interpretability of the CT-GRU, experiments across eleven datasets show performance essentially identical to a standard GRU with Delta t inputs, indicating robustness of GRU/LSTM approaches to time in practice. The study highlights that explicit time-scale dynamics do not confer a clear advantage on these tasks, though the CT-GRU provides valuable insights into how memory might be organized across time. The results emphasize the value of inductive biases in time-aware architectures and offer guidance for future exploration of multiscale temporal representations.

Abstract

We investigate recurrent neural network architectures for event-sequence processing. Event sequences, characterized by discrete observations stamped with continuous-valued times of occurrence, are challenging due to the potentially wide dynamic range of relevant time scales as well as interactions between time scales. We describe four forms of inductive bias that should benefit architectures for event sequences: temporal locality, position and scale homogeneity, and scale interdependence. We extend the popular gated recurrent unit (GRU) architecture to incorporate these biases via intrinsic temporal dynamics, obtaining a continuous-time GRU. The CT-GRU arises by interpreting the gates of a GRU as selecting a time scale of memory, and the CT-GRU generalizes the GRU by incorporating multiple time scales of memory and performing context-dependent selection of time scales for information storage and retrieval. Event time-stamps drive decay dynamics of the CT-GRU, whereas they serve as generic additional inputs to the GRU. Despite the very different manner in which the two models consider time, their performance on eleven data sets we examined is essentially identical. Our surprising results point both to the robustness of GRU and LSTM architectures for handling continuous time, and to the potency of incorporating continuous dynamics into neural architectures.

Paper Structure

This paper contains 17 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: A schematic of the GRU (left) and CT-GRU (right). Color coding of the elements matches the background color used in the tables presenting activation dynamics. For the CT-GRU, the large rectangle with segments represents a multiscale hidden representation. The intrinsic decay temporal decay of this representation, as well as the recurrent self-connections, is not depicted in the schematic.
  • Figure 2: (a) Half life for a range of time scales: true value (dashed black line) and mixture approximation (blue line). (b) Decay curves for time scales $\tau \in [10, 100]$ (solid lines) and the mixture approximation (dashed lines).
  • Figure 3: Working memory task: (a) CT-GRU (blue) and GRU (orange) response to probe on sequences like $\{\text{\sffamily \scshape l}/0, \text{\sffamily x}/0, \ldots, \text{\sffamily x}/t \}$ for a range of $t$. (b) Storage timescales, $\log_{10}({\bm{\tau}}_k^{ S} )$, and event-detection weights, ${\bm{W}}^{ Q}$. The CT-GRU modulates storage time scale of symbol based on the context.
  • Figure 4: Event sequences for (a) Cluster, (b) Hawkes process, and (c) Reddit. Time is on horizontal axis. Color denotes event label; in (a), irrelevant labels are rendered as dashed black lines.
  • Figure 5: Comparison of GRU, CT-GRU, and variants. Data sets (a)-(i) consist of at least 10k training and test examples and thus a single train/test split is adequate for evaluation. Smaller data set (j) is tested via 8-fold cross validation. Solid black lines represent a reference baseline performance level, and dashed lines indicate optimal performance (where known).