Inter-Series Transformer: Attending to Products in Time Series Forecasting

Rares Cristian; Pavithra Harsha; Clemente Ocejo; Georgia Perakis; Brian Quanz; Ioannis Spantidakis; Hamza Zerhouni

Inter-Series Transformer: Attending to Products in Time Series Forecasting

Rares Cristian, Pavithra Harsha, Clemente Ocejo, Georgia Perakis, Brian Quanz, Ioannis Spantidakis, Hamza Zerhouni

TL;DR

The paper tackles the challenge of forecasting in supply-chain contexts, where sparsity and cross-series effects hinder traditional and standard Transformer models. It proposes the Inter-Series Transformer, which first applies a cross-series attention layer to inform the target time series and then passes a shared, multi-task per-series Transformer, enabling both cross-series interactions and per-series temporal modeling. Empirical results on a private medical-device dataset and two Walmart retail datasets show the approach often outperforms baselines and competitive state-of-the-art Transformer forecasts, with ablations highlighting the value of high-dimensional feature projections and the omission of positional encoding in favor of explicit date features. The work advances practical demand forecasting by addressing sparsity, overfitting, and cross-series effects, with interpretable attention patterns and robust cross-validation analyses supporting its applicability and potential impact in real-world supply chains.

Abstract

Time series forecasting is an important task in many fields ranging from supply chain management to weather forecasting. Recently, Transformer neural network architectures have shown promising results in forecasting on common time series benchmark datasets. However, application to supply chain demand forecasting, which can have challenging characteristics such as sparsity and cross-series effects, has been limited. In this work, we explore the application of Transformer-based models to supply chain demand forecasting. In particular, we develop a new Transformer-based forecasting approach using a shared, multi-task per-time series network with an initial component applying attention across time series, to capture interactions and help address sparsity. We provide a case study applying our approach to successfully improve demand prediction for a medical device manufacturing company. To further validate our approach, we also apply it to public demand forecasting datasets as well and demonstrate competitive to superior performance compared to a variety of baseline and state-of-the-art forecast methods across the private and public datasets.

Inter-Series Transformer: Attending to Products in Time Series Forecasting

TL;DR

Abstract

Paper Structure (22 sections, 7 equations, 6 figures, 6 tables)

This paper contains 22 sections, 7 equations, 6 figures, 6 tables.

Introduction
Related Work
Traditional time series models
RNN and CNN-based time series models
Transformer-based time series models
Key model differences with our approach
Experiment differences with our approach
Methodology
Problem Definition
Preliminaries
Model Architecture
Attention layers
Projection to High Dimensional Representation
Abandonment of Positional Encoding
Experimental Setup
...and 7 more sections

Figures (6)

Figure 1: Inter-Series Transformer Diagram with Inter-Series Attention, illustrated here with a single encoder and a single decoder block. Inputs include $\mathbf{P}$ the matrix containing all target time series, $\mathbf{P}_q$ the target time series of product $q$, $\mathbf{X}_q$ the feature matrix of product $q$, and $\mathbf{X}^{IS}$ the output from the Inter-Series Attention layer. The circled plus symbol indicates concatenation in the last dimension
Figure 2: Attention weights learned between products / time series of type 1 - for one prediction. Each row shows the attention weights for each other series across the columns, for that target series. Lighter color indicates a higher value.
Figure 3: Example of high-volume products. Products 1, 4, 10 and 41 have most of the attention weight from sparser time series as shown in Figure \ref{['attw']}.
Figure 4: Example of low-volume and sparse products. Many time series, such as products 6 and 45, depend highly on products 1, 4, 10 and 41 as shown in Figure \ref{['attw']}.
Figure 5: Comparison of value distribution for sparse (products 6 and 45) and high-volume (products 4 and 10) time series.
...and 1 more figures

Inter-Series Transformer: Attending to Products in Time Series Forecasting

TL;DR

Abstract

Inter-Series Transformer: Attending to Products in Time Series Forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (6)