Table of Contents
Fetching ...

The impact of memory on learning sequence-to-sequence tasks

Alireza Seif, Sarah A. M. Loos, Gennaro Tucci, Édgar Roldán, Sebastian Goldt

TL;DR

A simple model is proposed for a seq2seq task that has the advantage of providing explicit control over the degree of memory, or non-Markovianity, in the sequences—the stochastic switching-Ornstein–Uhlenbeck (SSOU) model and a measure of non-Markovianity to quantify the amount of memory in the sequences.

Abstract

The recent success of neural networks in natural language processing has drawn renewed attention to learning sequence-to-sequence (seq2seq) tasks. While there exists a rich literature that studies classification and regression tasks using solvable models of neural networks, seq2seq tasks have not yet been studied from this perspective. Here, we propose a simple model for a seq2seq task that has the advantage of providing explicit control over the degree of memory, or non-Markovianity, in the sequences -- the stochastic switching-Ornstein-Uhlenbeck (SSOU) model. We introduce a measure of non-Markovianity to quantify the amount of memory in the sequences. For a minimal auto-regressive (AR) learning model trained on this task, we identify two learning regimes corresponding to distinct phases in the stationary state of the SSOU process. These phases emerge from the interplay between two different time scales that govern the sequence statistics. Moreover, we observe that while increasing the integration window of the AR model always improves performance, albeit with diminishing returns, increasing the non-Markovianity of the input sequences can improve or degrade its performance. Finally, we perform experiments with recurrent and convolutional neural networks that show that our observations carry over to more complicated neural network architectures.

The impact of memory on learning sequence-to-sequence tasks

TL;DR

A simple model is proposed for a seq2seq task that has the advantage of providing explicit control over the degree of memory, or non-Markovianity, in the sequences—the stochastic switching-Ornstein–Uhlenbeck (SSOU) model and a measure of non-Markovianity to quantify the amount of memory in the sequences.

Abstract

The recent success of neural networks in natural language processing has drawn renewed attention to learning sequence-to-sequence (seq2seq) tasks. While there exists a rich literature that studies classification and regression tasks using solvable models of neural networks, seq2seq tasks have not yet been studied from this perspective. Here, we propose a simple model for a seq2seq task that has the advantage of providing explicit control over the degree of memory, or non-Markovianity, in the sequences -- the stochastic switching-Ornstein-Uhlenbeck (SSOU) model. We introduce a measure of non-Markovianity to quantify the amount of memory in the sequences. For a minimal auto-regressive (AR) learning model trained on this task, we identify two learning regimes corresponding to distinct phases in the stationary state of the SSOU process. These phases emerge from the interplay between two different time scales that govern the sequence statistics. Moreover, we observe that while increasing the integration window of the AR model always improves performance, albeit with diminishing returns, increasing the non-Markovianity of the input sequences can improve or degrade its performance. Finally, we perform experiments with recurrent and convolutional neural networks that show that our observations carry over to more complicated neural network architectures.
Paper Structure (34 sections, 19 equations, 10 figures)

This paper contains 34 sections, 19 equations, 10 figures.

Figures (10)

  • Figure 1: A flexible, minimal model for sequence-to-sequence learning tasks with varying degrees of memory along the sequence.Left: The motion of a Brownian particle (black filled circle) in a switching parabolic potential "trap" yields dynamics as given by the SSOU (stochastic switching Ornstein Uhlenbeck) model given by \ref{['eq:eom-x']}. Example trajectories for sequences of the particle's position $X_t$ (black line) and the trap center $C_t$ (blue line) as shown as a function of time $t$. The blue dashed line is set at the trap centers $C_0=\pm 1$. Middle: We train three types of models to reconstruct the trap positions $C_t$ from the particle trajectories $X_t$: auto-regressive models (AR) that makes predictions of the trap position based on the past $W$ observations of the particles position, as well as 1D convolutional neural networks (CNN) that acts as a set of $f$ parallel AR models with added nonlinearity in the output, and recurrent neural networks (RNN) that use feedback loops with a $d$ dimensional internal state to capture dependencies across time steps.Right: Sample reconstruction of the particle trap positions $\hat{y}_t$ (red dashed line) for one example input sequence ($X_t$, black solid line) compared with the actual hidden trap position ($C_t$, blue solid line). The sequence is a zoomed-in view of the sample sequence in the left panel. Parameters: $\kappa =10$, $k=1$, $D=0.5$, simulation time step $\Delta t=0.02$, AR model window size $W=2$. For this example, we used 5000 training samples, evaluated them over 5000 test samples, a mini-batch size 32 and 5 epochs.
  • Figure 2: Quantifying the memory of non-Markovian input sequences. Left: waiting-time distribution $\psi_k(\tau)$ given by \ref{['eq:waitingtimes']} for three choices of $k$. Right: measure ${M}(t)$ (see \ref{['eq:non-markovianity']}) associated with the parameters in the left panel, which quantifies the memory of the past time $t_1$ in the sequence. We obtain a memoryless sequence $X_t$ with ${M}(t)=0$ by choosing ${k}=1$ which creates an exponential waiting-time distribution, \ref{['eq:waitingtimes']} (red curves). As the value of $k$ increases, the memory becomes stronger (blue and green curves). Parameters: $D=0.5$, $\kappa=2$, $\Omega_1=\Omega_2=[0.5,1.5]$, $t_3-t_2=0.5$, by generating $N=10^5$ trajectories via the Euler–Maruyama method kloeden1992stochastic with simulation time step $\Delta t=5\times 10^{-3}$.
  • Figure 3: Performance of auto-regressive AR(2) models for Markovian and non-Markovian sequences of the SSOU model across its "learnability phase diagram". Reconstruction error of the AR(2) model (in % of correctly predicted trap positions) for different values of the diffusive time scale $t_{\rm diff} = 1/2D$ and the relaxation time $t_\kappa = 1 / \kappa$, see \ref{['eq:time scales']}. We show these "phase diagrams" for (Left) Markovian sequences ($k=1$) and for (Right) non-Markovian sequences ($k=5$). In these diagrams, the two axes represent ratios of time scales, $t_{\rm diff}/t_\kappa$ (vertical axis) vs $t_\kappa/\langle \tau \rangle_k= 1/\kappa$ (horizontal axis), since $\langle \tau \rangle_k=1$ for all $k$. The bottom row depicts the distribution of particle's position $p(x_t)$ across the dashed line superimposed with the error heat map. The density shows a clear transition from a bimodal distribution to a unimodal one. Parameters: total simulation time for each parameter value $\tau=30$, simulation time step $\Delta t$=0.01, AR($W$) window size $W=2$, $\kappa$ varying from 0.1 to 0.6, $\tau_h=2$, remaining training parameters as in \ref{['fig:dynamics']}.
  • Figure 4: Statistical characterisation of the error(Left) Scatter plot of the error of several AR(2) models trained on Markovian sequences versus the Sarle coefficient, a measure of the bimodality of the distribution $p(x_t)$, \ref{['eq:sarle']}. (Right) Scatter plot of the error of several AR(2) models trained on Markovian sequences versus the excess kurtosis, \ref{['eq:excesskurt']}, another measure of the bimodality of the distribution. Parameters: Similar to those used in the right panel of \ref{['fig:eq-phase-diagram']}
  • Figure 5: The impact of sequence memory on recurrent and convolutional neural networks.(Left) Analytical predictions for the covariance of the particle and trap position, \ref{['eq:correlations']}, in the non-Markovian case as a function of $k$. As we increase the non-Markovianity, correlations between $x_t$ and $c_t$ decay, complicating the reconstruction task. (Right) Prediction error $\epsilon$ for students with different architectures: auto-regressive (AR), convolutional (CNN) and gated recurrent neural network (GRU). The "memory units" of the architectures correspond to the size of the kernel ($W$) in the AR and the first layer of CNN models, and the number of units ($d$) in the hidden state of GRU models. The CNN models all have $f=10$ filters in their first layer. Parameters: Similar to \ref{['fig:eq-phase-diagram']} with $\kappa=2$.
  • ...and 5 more figures