Online Data Augmentation for Forecasting with Deep Learning
Vitor Cerqueira, Moisés Santos, Luis Roque, Yassine Baghoussi, Carlos Soares
TL;DR
This work tackles forecasting with multiple univariate time series under data scarcity by introducing online data augmentation, generating synthetic samples on-the-fly during training in a model-agnostic framework. By pairing each batch's originals with corresponding synthetic variants, it maintains a balanced representation and avoids storing large augmented datasets. Empirical results across six datasets, three neural architectures, and seven generation methods show online augmentation often yields better forecasting accuracy than offline augmentation or no augmentation. The authors provide a public, extensible framework to reproduce and extend these findings, with potential for adaptive and multivariate extensions in future work.
Abstract
Deep learning approaches are increasingly used to tackle forecasting tasks involving datasets with multiple univariate time series. A key factor in the successful application of these methods is a large enough training sample size, which is not always available. Synthetic data generation techniques can be applied in these scenarios to augment the dataset. Data augmentation is typically applied offline before training a model. However, when training with mini-batches, some batches may contain a disproportionate number of synthetic samples that do not align well with the original data characteristics. This work introduces an online data augmentation framework that generates synthetic samples during the training of neural networks. By creating synthetic samples for each batch alongside their original counterparts, we maintain a balanced representation between real and synthetic data throughout the training process. This approach fits naturally with the iterative nature of neural network training and eliminates the need to store large augmented datasets. We validated the proposed framework using 3797 time series from 6 benchmark datasets, three neural architectures, and seven synthetic data generation techniques. The experiments suggest that online data augmentation leads to better forecasting performance compared to offline data augmentation or no augmentation approaches. The framework and experiments are publicly available.
