Table of Contents
Fetching ...

Enhancing Masked Time-Series Modeling via Dropping Patches

Tianyu Qiu, Yi Xie, Yun Xiong, Hao Niu, Xiaofeng Gao

TL;DR

DropPatch introduces random sub-sequence patch dropping before masking to improve masked time-series pre-training. By dropping patches with ratio $r$ prior to masking and reconstructing only the remaining masked patches, the method strengthens attention focus, reduces redundancy, and provides data augmentation that mitigates over-fitting. Empirical results across 12 real datasets and synthesized corpora show consistent gains in in-domain and cross-domain forecasting, along with improved training efficiency; theoretical analysis shows slower convergence to a rank-1 representation in Transformer layers. Attention- and representation-level analyses (KL divergence of attention, head diversity, and CKAs) support the mechanism behind performance gains, suggesting DropPatch as a practical augment for time-series foundation models with broad applicability to domain-adaptation and low-data regimes.

Abstract

This paper explores how to enhance existing masked time-series modeling by randomly dropping sub-sequence level patches of time series. On this basis, a simple yet effective method named DropPatch is proposed, which has two remarkable advantages: 1) It improves the pre-training efficiency by a square-level advantage; 2) It provides additional advantages for modeling in scenarios such as in-domain, cross-domain, few-shot learning and cold start. This paper conducts comprehensive experiments to verify the effectiveness of the method and analyze its internal mechanism. Empirically, DropPatch strengthens the attention mechanism, reduces information redundancy and serves as an efficient means of data augmentation. Theoretically, it is proved that DropPatch slows down the rate at which the Transformer representations collapse into the rank-1 linear subspace by randomly dropping patches, thus optimizing the quality of the learned representations

Enhancing Masked Time-Series Modeling via Dropping Patches

TL;DR

DropPatch introduces random sub-sequence patch dropping before masking to improve masked time-series pre-training. By dropping patches with ratio prior to masking and reconstructing only the remaining masked patches, the method strengthens attention focus, reduces redundancy, and provides data augmentation that mitigates over-fitting. Empirical results across 12 real datasets and synthesized corpora show consistent gains in in-domain and cross-domain forecasting, along with improved training efficiency; theoretical analysis shows slower convergence to a rank-1 representation in Transformer layers. Attention- and representation-level analyses (KL divergence of attention, head diversity, and CKAs) support the mechanism behind performance gains, suggesting DropPatch as a practical augment for time-series foundation models with broad applicability to domain-adaptation and low-data regimes.

Abstract

This paper explores how to enhance existing masked time-series modeling by randomly dropping sub-sequence level patches of time series. On this basis, a simple yet effective method named DropPatch is proposed, which has two remarkable advantages: 1) It improves the pre-training efficiency by a square-level advantage; 2) It provides additional advantages for modeling in scenarios such as in-domain, cross-domain, few-shot learning and cold start. This paper conducts comprehensive experiments to verify the effectiveness of the method and analyze its internal mechanism. Empirically, DropPatch strengthens the attention mechanism, reduces information redundancy and serves as an efficient means of data augmentation. Theoretically, it is proved that DropPatch slows down the rate at which the Transformer representations collapse into the rank-1 linear subspace by randomly dropping patches, thus optimizing the quality of the learned representations

Paper Structure

This paper contains 42 sections, 4 theorems, 30 equations, 7 figures, 13 tables.

Key Result

Lemma 1

Let $\mathrm{SAN}$ denote a self-attention layer, and consider stacking $L$ such layers. Then, under certain conditions, the representations within the stacked self-attention layers will converge to a rank-1 matrix as $L \to \infty$.

Figures (7)

  • Figure 1: (A) The loss curve of PatchTST with lower mask ratio 0.4 (official implementation); (B) The loss curve of DropPatch (unless otherwise stated, the drop ratio and mask ratio is 0.6 and 0.4 throughout this paper); (C) The Kullback-Leibler (KL) divergence between the attention coefficients of the final encoder layer and a uniform distribution, where each dot represents an individual attention head. A larger KL divergence indicates that this set of attention distributions is farther from a uniform distribution and thus more focused. PatchTST(0.78) refers to the PatchTST configured with a mask ratio of 0.78, matching the number of visible patches in DropPatch. (D) Comparison of MSE metrics between PatchTST and DropPatch with forecasting steps $T \in \{96, 720\}$ on ETTm1.
  • Figure 2: The overall pre-training framework of DropPatch.
  • Figure 3: Analysis of (A) normalized distance, and (B) KL divergence between attention distributions and uniform distribution for each head across all layers. Each dot represents an individual attention head, while different colors indicate different layers.
  • Figure 4: Attention distribution difference across attention heads in the last layer.
  • Figure 5: Models are pre-trained on the ECL dataset and subsequently fine-tuned on ECL (in-domain) and on the ETTh1 and Weather (cross-domain) datasets.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Lemma 1
  • Corollary 1
  • Lemma 2
  • proof
  • Corollary 2
  • proof