Table of Contents
Fetching ...

TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting

Sravan Kumar Ankireddy, Nikita Seleznev, Nam H. Nguyen, Yulun Wu, Senthil Kumar, Furong Huang, C. Bayan Bruss

TL;DR

TimeSqueeze is introduced, a dynamic patching mechanism that adaptively selects patch boundaries within each sequence based on local signal complexity, and consistently outperforms comparable architectures that use either point-wise tokenization or fixed-size patching.

Abstract

Transformer-based time series foundation models face a fundamental trade-off in choice of tokenization: point-wise embeddings preserve temporal fidelity but scale poorly with sequence length, whereas fixed-length patching improves efficiency by imposing uniform boundaries that may disrupt natural transitions and blur informative local dynamics. In order to address these limitations, we introduce TimeSqueeze, a dynamic patching mechanism that adaptively selects patch boundaries within each sequence based on local signal complexity. TimeSqueeze first applies a lightweight state-space encoder to extract full-resolution point-wise features, then performs content-aware segmentation by allocating short patches to information-dense regions and long patches to smooth or redundant segments. This variable-resolution compression preserves critical temporal structure while substantially reducing the token sequence presented to the Transformer backbone. Specifically for large-scale pretraining, TimeSqueeze attains up to 20x faster convergence and 8x higher data efficiency compared to equivalent point-token baselines. Experiments across long-horizon forecasting benchmarks show that TimeSqueeze consistently outperforms comparable architectures that use either point-wise tokenization or fixed-size patching.

TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting

TL;DR

TimeSqueeze is introduced, a dynamic patching mechanism that adaptively selects patch boundaries within each sequence based on local signal complexity, and consistently outperforms comparable architectures that use either point-wise tokenization or fixed-size patching.

Abstract

Transformer-based time series foundation models face a fundamental trade-off in choice of tokenization: point-wise embeddings preserve temporal fidelity but scale poorly with sequence length, whereas fixed-length patching improves efficiency by imposing uniform boundaries that may disrupt natural transitions and blur informative local dynamics. In order to address these limitations, we introduce TimeSqueeze, a dynamic patching mechanism that adaptively selects patch boundaries within each sequence based on local signal complexity. TimeSqueeze first applies a lightweight state-space encoder to extract full-resolution point-wise features, then performs content-aware segmentation by allocating short patches to information-dense regions and long patches to smooth or redundant segments. This variable-resolution compression preserves critical temporal structure while substantially reducing the token sequence presented to the Transformer backbone. Specifically for large-scale pretraining, TimeSqueeze attains up to 20x faster convergence and 8x higher data efficiency compared to equivalent point-token baselines. Experiments across long-horizon forecasting benchmarks show that TimeSqueeze consistently outperforms comparable architectures that use either point-wise tokenization or fixed-size patching.
Paper Structure (33 sections, 6 equations, 14 figures, 10 tables)

This paper contains 33 sections, 6 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Architectural overview of TimeSqueeze forecasting model. An SSM encoder first processes the raw series at full resolution to extract fine-grained features. Dynamic patching then adaptively compresses the sequence, selecting the salient subset of features. A Transformer backbone performs contextual modeling on the downsampled features, and an unpatching module upsamples the signal to the original resolution while preserving causality. Finally, an SSM decoder combines the compressed and fine-grained features, passing the hybrid features to multi-horizon heads, thereby improving efficiency without sacrificing temporal fidelity.
  • Figure 2: Computational efficiency comparison between TimeSqueeze$_{\text{base}}$ and Time-MoE: (a) Training memory and time requirements across different batch sizes and context lengths. TimeSqueeze achieves comparable performance while reducing memory usage by $3.4\times$ and training time by $\approx 20\times$. (b) Inference throughput across prediction horizons. TimeSqueeze delivers up to $10.5\times$ higher throughput for longer prediction horizons.
  • Figure 3: Ablation: (a) Average MSE across five benchmark datasets for prediction horizon 96 with different model components. (b) Pretraining Context Length vs Forecasting Performance: Longer pretraining context translates to improved performance, even when the inference context remains fixed at 512.
  • Figure 4: Performance scaling with training data size: Average MSE for 96-horizon forecasting across five benchmarks shows consistent improvement with increased training tokens.
  • Figure 5: Performance scaling with training data size: Average MSE for 96-horizon forecasting across five benchmarks shows consistent improvement with increased training tokens.
  • ...and 9 more figures