Tuning the burn-in phase in training recurrent neural networks improves their performance

Julian D. Schiller; Malte Heinrich; Victor G. Lopez; Matthias A. Müller

Tuning the burn-in phase in training recurrent neural networks improves their performance

Julian D. Schiller, Malte Heinrich, Victor G. Lopez, Matthias A. Müller

TL;DR

This work analyzes truncated backpropagation through time (TBPTT) for training recurrent neural networks on long sequences and identifies the burn-in phase $m$ as a critical hyperparameter that governs transient dynamics and performance. By viewing TBPTT through an optimization and turnpike lens, the authors derive regret and accuracy bounds that depend on $m$, the overlap between subsequences, and a stability factor $\lambda \in (0,1)$. They show that properly tuning $m$ can tightly bound the deviation from a benchmark trained on full sequences, and they validate these insights with synthetic and real-world time-series datasets, achieving up to 60% reductions in training and test error in some cases. The findings offer a principled approach to selecting TBPTT windowing and burn-in lengths, with practical implications for system identification and forecasting tasks where computational efficiency and generalization are critical.

Abstract

Training recurrent neural networks (RNNs) with standard backpropagation through time (BPTT) can be challenging, especially in the presence of long input sequences. A practical alternative to reduce computational and memory overhead is to perform BPTT repeatedly over shorter segments of the training data set, corresponding to truncated BPTT. In this paper, we examine the training of RNNs when using such a truncated learning approach for time series tasks. Specifically, we establish theoretical bounds on the accuracy and performance loss when optimizing over subsequences instead of the full data sequence. This reveals that the burn-in phase of the RNN is an important tuning knob in its training, with significant impact on the performance guarantees. We validate our theoretical results through experiments on standard benchmarks from the fields of system identification and time series forecasting. In all experiments, we observe a strong influence of the burn-in phase on the training process, and proper tuning can lead to a reduction of the prediction error on the training and test data of more than 60% in some cases.

Tuning the burn-in phase in training recurrent neural networks improves their performance

TL;DR

This work analyzes truncated backpropagation through time (TBPTT) for training recurrent neural networks on long sequences and identifies the burn-in phase

as a critical hyperparameter that governs transient dynamics and performance. By viewing TBPTT through an optimization and turnpike lens, the authors derive regret and accuracy bounds that depend on

, the overlap between subsequences, and a stability factor

. They show that properly tuning

can tightly bound the deviation from a benchmark trained on full sequences, and they validate these insights with synthetic and real-world time-series datasets, achieving up to 60% reductions in training and test error in some cases. The findings offer a principled approach to selecting TBPTT windowing and burn-in lengths, with practical implications for system identification and forecasting tasks where computational efficiency and generalization are critical.

Abstract

Paper Structure (22 sections, 8 theorems, 45 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 22 sections, 8 theorems, 45 equations, 6 figures, 2 tables, 1 algorithm.

Introduction
Recurrent neural networks
Training RNNs using TBPTT and mini-batch SGD
Data segmentation in TBPTT
Segment-wise loss function with burn-in phase
Basic training algorithm
Training performance
Regret guarantees for truncated learning
An optimization perspective
Main results
Experiments
Synthetic data
Data sets from system identification and time series forecasting
Conclusions
Theoretical analysis
...and 7 more sections

Key Result

Theorem 1

Let the TBPTT solution $\theta^*$ of the problem in eq:NLP_SGD and the benchmark solution $\theta^\mathrm{b}$ of the problem in eq:NLP_benchmark be given and assume that both parameters render the respective RNN incrementally output stable in the sense of Assumption ass:output_stability. Under an ad Here, $\lambda\in(0,1)$ is from Assumption ass:output_stability.

Figures (6)

Figure 1: General structure of an RNN (left) with its internal architecture unfolded in space and time (middle). The network consists of $L$ recurrent layers and processes an input sequence of length $T$. Black arrows denote instantaneous spatial processing, red arrows denote temporal processing. Each arrow represents learnable linear or nonlinear projections and/or layers, potentially with additional skip connections, which can generally be captured by the functions $f$ and $g$. The right figure shows an exemplary RNN for time series forecasting applications, producing a forecast of $F{\,=\,}3$ future values of a sequence $\{x_t\}$ using a lookback window of $N{\,=\,}4$.
Figure 2: Data segmentation in TPBTT learning. Depicted are the full data sequence $D$ of length $T$ (green) and four (out of $S$) subsequences of length $N$ (red). The subsequences $D_i$ and $D_{i-1}$ start at samples $s_i$ and $s_{i-1}$, respectively, and result in the overlap $o_i$, compare \ref{['eq:overlap_i']}. Blue highlighted segments appear in the loss function $L(\theta;D_i)$ in \ref{['eq:loss']}, which is determined by the burn-in phase length $m$.
Figure 3: Training results for the synthetic data experiment for different values of the burn-in phase $m$. Left: batch-averaged output error $e_j$. Right: performance $P$ (the truncated MSE from \ref{['eq:performance']}) achieved by solving the TBPTT problem \ref{['eq:NLP_SGD']} ('$\circ$' markers) and the benchmark problem \ref{['eq:NLP_benchmark']} ('$\times$' markers).
Figure 4: Training results. Left and middle: MSE of RNN predictions for training ('$\circ$') and test data ('$\times$') after (re-)training; gray lines correspond to the BPTT-trained RNN evaluated on the training (solid) and test data (dashed). Right: Averaged TBPTT training times relative to BPTT.
Figure 5: Supplementary results for the synthetic data experiment performed in Section \ref{['sec:exp_synthetic']}. The left column refers to the training data, the right column to the test data. The corresponding input-output time series are shown in the top row. In the middle row, we compare the outputs of the RNN trained with $m=0$ (without burn-in) in terms of the outputs ${y^*_{j|i}}$ (belonging to subsequences of the training data ${X_i^\mathrm{d}}$, $i\in\mathcal{S}$) and the outputs ${y^*_{t}}$ (generated by processing the entire input sequence ${X^\mathrm{d}}$) with the benchmark output ${y^\mathrm{b}_t}$ and the ground truth data ${y^\mathrm{d}_t}$. The bottom row shows the respective results when training the RNNs using the burn-in phase length $m=N-1=20$.
...and 1 more figures

Theorems & Definitions (20)

Remark 1
Theorem 1: informal
Remark 2
Theorem 2: informal
Remark 3
Lemma 1: Stabilizability
proof
Lemma 2
proof
Lemma 3: Coercivity
...and 10 more

Tuning the burn-in phase in training recurrent neural networks improves their performance

TL;DR

Abstract

Tuning the burn-in phase in training recurrent neural networks improves their performance

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (20)