Understanding Token-level Topological Structures in Transformer-based Time Series Forecasting

Jianqi Zhang; Wenwen Qiang; Jingyao Wang; Jiahuan Zhou; Changwen Zheng; Hui Xiong

Understanding Token-level Topological Structures in Transformer-based Time Series Forecasting

Jianqi Zhang, Wenwen Qiang, Jingyao Wang, Jiahuan Zhou, Changwen Zheng, Hui Xiong

TL;DR

This work identifies that Transformer-based time series forecasting models progressively distort original token-level topology as layers deepen, limiting predictive accuracy. It introduces the Topology Enhancement Method (TEM), a plug-and-play framework with two modules: PTEM to preserve the original positional topology and STEM to preserve semantic topology, both guided by a bi-level optimization strategy. The authors provide theoretical generalization bounds showing that maintaining topology tightens the bound and corroborate these findings with extensive experiments across multiple datasets and TSF baselines, where TEM yields consistent performance gains. The approach offers a practical, adaptable pathway to enhance Transformer TSF models without altering core architectures, with code released for reproducibility and broader applicability.

Abstract

Transformer-based methods have achieved state-of-the-art performance in time series forecasting (TSF) by capturing positional and semantic topological relationships among input tokens. However, it remains unclear whether existing Transformers fully leverage the intrinsic topological structure among tokens throughout intermediate layers. Through empirical and theoretical analyses, we identify that current Transformer architectures progressively degrade the original positional and semantic topology of input tokens as the network deepens, thus limiting forecasting accuracy. Furthermore, our theoretical results demonstrate that explicitly enforcing preservation of these topological structures within intermediate layers can tighten generalization bounds, leading to improved forecasting performance. Motivated by these insights, we propose the Topology Enhancement Method (TEM), a novel Transformer-based TSF method that explicitly and adaptively preserves token-level topology. TEM consists of two core modules: 1) the Positional Topology Enhancement Module (PTEM), which injects learnable positional constraints to explicitly retain original positional topology; 2) the Semantic Topology Enhancement Module (STEM), which incorporates a learnable similarity matrix to preserve original semantic topology. To determine optimal injection weights adaptively, TEM employs a bi-level optimization strategy. The proposed TEM is a plug-and-play method that can be integrated with existing Transformer-based TSF methods. Extensive experiments demonstrate that integrating TEM with a variety of existing methods significantly improves their predictive performance, validating the effectiveness of explicitly preserving original token-level topology. Our code is publicly available at: \href{https://github.com/jlu-phyComputer/TEM}{https://github.com/jlu-phyComputer/TEM}.

Understanding Token-level Topological Structures in Transformer-based Time Series Forecasting

TL;DR

Abstract

Paper Structure (50 sections, 2 theorems, 47 equations, 17 figures, 15 tables, 1 algorithm)

This paper contains 50 sections, 2 theorems, 47 equations, 17 figures, 15 tables, 1 algorithm.

Introduction
Related Work
Problem Analysis and Motivation
Definition and Notation
Empirical Findings
Theoretical Analysis
Method
Positional Topology Enhancement Module
Semantic Topology Enhancement Module
Overall Optimization
Experiments
Experimental Settings
Comparative Experimental Results
Ablation Study
Ablation Study of PTEM, STEM
...and 35 more sections

Key Result

Theorem 3.4

Let $\mathcal{X}$ and $\mathcal{Y}$ denote the input and output spaces for TSF, and let $\mathcal{H}$ be a class of Transformer-based functions $h: \mathcal{X} \rightarrow \mathcal{Y}$. Assume a non-negative, $\xi$-Lipschitz loss function $\ell: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}$ where $\delta' = \delta - 4(\mu - 1)\beta(a)$, $\mathfrak{R}_\mu(\cdot)$ is the Rademacher complexi

Figures (17)

Figure 1: (a) Illustration of the change in the value of HSIC between the PE and the output feature of each encoder layer. (b) Illustration of the change in the value of HSIC between the similarity matrix of input tokens and the similarity matrix of output tokens of each encoder layer. Changes in model performance after enhancing positional topology of the input tokens (c) or semantic topology of the input tokens (d) in deep layers. EPT/EST stands for "enhanced positional/semantic topology of the input tokens".
Figure 2: (a), (b), (c): Illustration of the change in the value of HSIC between the similarity matrix of input tokens and the similarity matrix of output tokens of each encoder layer. (d), (e), (f): Changes in model performance after enhancing the semantic topology of the input tokens. EST stands for "enhanced semantic topology of the input tokens". Each row from left to right shows the results of the baseline methods using RoPE su2024roformer, CPE chu2021conditional, and GL-PE lv2025toward.
Figure 3: Overview of the Topology Enhancement Method (TEM). (a) Architecture diagram of the current popular Transformer-based TSF methods. (b) Illustration of the integration locations of PTEM and STEM within Transformer-based TSF methods.
Figure 4: Illustration of our model’s performance under different fixed values of $\Gamma_Q$/$\Gamma_K$/$\Gamma_V$/$\Xi$ across various datasets (ETTm1, Weather, and ECL) and prediction lengths (96 and 720). The algorithm adopts PatchTST+TEM. The first point on each curve represents the performance of using the adaptive weight. The dashed lines, drawn horizontally through the first points of each curve, are intended to facilitate comparison between the performance of using the adaptive weight and that under other fixed-value settings.
Figure 5: Illustration of our model’s performance under different fixed values of $\Gamma_Q$/$\Gamma_K$/$\Gamma_V$/$\Xi$ across various datasets (ETTm1, Weather, and ECL) and prediction lengths (96 and 720). The algorithm adopts iTransformer+TEM. The setting is the same as Figure \ref{['fig_adp_p']}.
...and 12 more figures

Theorems & Definitions (9)

Definition 3.1: Positional Topology
Definition 3.2: Semantic Topology
Definition 3.3: Original Topological Structure among Input Tokens of Transformer-based TSF
Theorem 3.4
Definition 3.5: Layer-wise positional topology Distortion
Definition 3.6: Layer-wise Semantic Topology Distortion
Theorem 3.7
proof
proof

Understanding Token-level Topological Structures in Transformer-based Time Series Forecasting

TL;DR

Abstract

Understanding Token-level Topological Structures in Transformer-based Time Series Forecasting

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (9)