Tuning the burn-in phase in training recurrent neural networks improves their performance
Julian D. Schiller, Malte Heinrich, Victor G. Lopez, Matthias A. Müller
TL;DR
This work analyzes truncated backpropagation through time (TBPTT) for training recurrent neural networks on long sequences and identifies the burn-in phase $m$ as a critical hyperparameter that governs transient dynamics and performance. By viewing TBPTT through an optimization and turnpike lens, the authors derive regret and accuracy bounds that depend on $m$, the overlap between subsequences, and a stability factor $\lambda \in (0,1)$. They show that properly tuning $m$ can tightly bound the deviation from a benchmark trained on full sequences, and they validate these insights with synthetic and real-world time-series datasets, achieving up to 60% reductions in training and test error in some cases. The findings offer a principled approach to selecting TBPTT windowing and burn-in lengths, with practical implications for system identification and forecasting tasks where computational efficiency and generalization are critical.
Abstract
Training recurrent neural networks (RNNs) with standard backpropagation through time (BPTT) can be challenging, especially in the presence of long input sequences. A practical alternative to reduce computational and memory overhead is to perform BPTT repeatedly over shorter segments of the training data set, corresponding to truncated BPTT. In this paper, we examine the training of RNNs when using such a truncated learning approach for time series tasks. Specifically, we establish theoretical bounds on the accuracy and performance loss when optimizing over subsequences instead of the full data sequence. This reveals that the burn-in phase of the RNN is an important tuning knob in its training, with significant impact on the performance guarantees. We validate our theoretical results through experiments on standard benchmarks from the fields of system identification and time series forecasting. In all experiments, we observe a strong influence of the burn-in phase on the training process, and proper tuning can lead to a reduction of the prediction error on the training and test data of more than 60% in some cases.
