Table of Contents
Fetching ...

Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility

Annan Yu, Danielle C. Maddix, Boran Han, Xiyuan Zhang, Abdul Fatir Ansari, Oleksandr Shchur, Christos Faloutsos, Andrew Gordon Wilson, Michael W. Mahoney, Yuyang Wang

TL;DR

This paper analyzes Transformers for time-series through a rank-structure lens, revealing that time-series embeddings have sharply decaying singular values which enable accurate low-rank approximations of the $Q/K/V$ projections and compressible attention. It introduces the flow-of-ranks concept to explain how depth and nonlinearities increase rank, and proves formal guarantees on attention compressibility for low-rank inputs while showing incompressibility for high-rank inputs. The authors develop a principled framework that guides design choices for width, depth, and heads, and demonstrate practical compression on Chronos and Chronos-Bolt, achieving substantial speedups and memory reductions with minimal or no loss in accuracy. They also compare pretrained vs. pretraining-compressed models, showing that a compressed, purpose-built TSFM can outperform traditional local methods and expand the time–accuracy Pareto frontier. Overall, the work provides both theoretical insights and actionable techniques for exploiting compressibility in time-series foundation models, with broad implications for efficient deployment in data-scarce regimes.

Abstract

Transformers are widely used across data modalities, and yet the principles distilled from text models often transfer imperfectly to models trained to other modalities. In this paper, we analyze Transformers through the lens of rank structure. Our focus is on the time series setting, where the structural properties of the data differ remarkably from those of text or vision. We show that time-series embeddings, unlike text or vision, exhibit sharply decaying singular value spectra: small patch sizes and smooth continuous mappings concentrate the data into low-rank subspaces. From this, we prove that the associated $Q/K/V$ projections admit accurate low-rank approximations, and that attention layers become compressible in proportion to the decay of the embedding spectrum. We introduce the concept of flow-of-ranks, a phenomenon by which nonlinear mixing across depth inflates the rank, explaining why early layers are most amenable to compression and why ranks grow with depth. Guided by these theoretical and empirical results, we use these insights to compress Chronos, a large time series foundation model, achieving a reduction of $65\%$ in inference time and $81\%$ in memory, without loss of accuracy. Our findings provide principled guidance for allocating width, depth, and heads in time series foundation models, and for exploiting their inherent compressibility.

Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility

TL;DR

This paper analyzes Transformers for time-series through a rank-structure lens, revealing that time-series embeddings have sharply decaying singular values which enable accurate low-rank approximations of the projections and compressible attention. It introduces the flow-of-ranks concept to explain how depth and nonlinearities increase rank, and proves formal guarantees on attention compressibility for low-rank inputs while showing incompressibility for high-rank inputs. The authors develop a principled framework that guides design choices for width, depth, and heads, and demonstrate practical compression on Chronos and Chronos-Bolt, achieving substantial speedups and memory reductions with minimal or no loss in accuracy. They also compare pretrained vs. pretraining-compressed models, showing that a compressed, purpose-built TSFM can outperform traditional local methods and expand the time–accuracy Pareto frontier. Overall, the work provides both theoretical insights and actionable techniques for exploiting compressibility in time-series foundation models, with broad implications for efficient deployment in data-scarce regimes.

Abstract

Transformers are widely used across data modalities, and yet the principles distilled from text models often transfer imperfectly to models trained to other modalities. In this paper, we analyze Transformers through the lens of rank structure. Our focus is on the time series setting, where the structural properties of the data differ remarkably from those of text or vision. We show that time-series embeddings, unlike text or vision, exhibit sharply decaying singular value spectra: small patch sizes and smooth continuous mappings concentrate the data into low-rank subspaces. From this, we prove that the associated projections admit accurate low-rank approximations, and that attention layers become compressible in proportion to the decay of the embedding spectrum. We introduce the concept of flow-of-ranks, a phenomenon by which nonlinear mixing across depth inflates the rank, explaining why early layers are most amenable to compression and why ranks grow with depth. Guided by these theoretical and empirical results, we use these insights to compress Chronos, a large time series foundation model, achieving a reduction of in inference time and in memory, without loss of accuracy. Our findings provide principled guidance for allocating width, depth, and heads in time series foundation models, and for exploiting their inherent compressibility.

Paper Structure

This paper contains 28 sections, 8 theorems, 103 equations, 16 figures, 4 tables.

Key Result

Theorem 1

Given any hidden dimension $d > 1$, let $\boldsymbol{\phi}: [-1,1] \rightarrow \mathbb{R}^d$ be a function that embeds $[-1,1]$ into $\mathbb{R}^d$. Given $L$ arbitrary points $x_1, \ldots, x_L$ sampled from $[-1,1]$, define Let $s_1 \geq \cdots \geq s_d \geq 0$ and $\sigma_1 \geq \cdots \geq \sigma_d \geq 0$ be the singular values of the quasimatrix $\boldsymbol{\Xi}$ and matrix $\boldsymbol{\Ps

Figures (16)

  • Figure 1: Overview of our results. We show that the embedded inputs of Transformers trained with time-series data have much lower ranks than those of other modalities, including Vision Transformers and Transformers trained with language data (see \ref{['sec:embedding']}); we prove that attention matrices on low-rank inputs are well-approximated by low-rank matrices (see \ref{['sec:inputs_to_attention']}); and we introduce and demonstrate a concept called flow-of-ranks, describing how attention matrices in earlier layers are more compressible than those in later layers (see \ref{['sec:flowofranks']}).
  • Figure 2: (a): Singular values of the embedded input matrices from many different TSFMs, a TFM, a ViT, and an LLM. (b,c): Embedding space of Chronos and a T5 LLM, respectively, visualized by projecting them onto the leading two singular vectors of the embedding matrix.
  • Figure 3: (a): Singular values of the embedded input matrices from Chronos-Bolt models pretrained with different patch sizes $k$. (b): Angles between Chronos-Bolt's embedded vectors in $\mathbb{R}^d = \mathbb{R}^{768}$ defined in \ref{['eq.angles']}, where the patches $\mathbf{x}^{(i)}$ are from a sinusoidal wave and Gaussian white noises, respectively. We also plot the angles between i.i.d. random Gaussian vectors in $\mathbb{R}^k = \mathbb{R}^{16}$ and $\mathbb{R}^d = \mathbb{R}^{768}$ for comparison.
  • Figure 4: The averaged $\varepsilon$-rank of query projection matrices $\mathbf{W}_Q$ in pretrained Chronos models and T5 LLMs. In (a,b), we vary the hidden dimension $d$. The light blue curves are contours of the ratio between the horizontal and the vertical axes in the semilog-x scale. In (c), we fix the hidden dimension $d = 512$ and change the rank of the fixed input embedding $\boldsymbol{\Xi}$ (see \ref{['sec:chebyshev']}).
  • Figure 5: (a,b): The $\varepsilon$-rank of every query projection $\mathbf{W}_Q$ in the encoder of a Chronos model and a T5 model of the same size, respectively. (c): Singular values of the input matrix to each Chronos' encoder layer, starting with a constant signal $(x_1, \ldots, x_T) = (0, \ldots, 0)$.
  • ...and 11 more figures

Theorems & Definitions (16)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Corollary 1
  • proof : Proof of \ref{['thm.embedding']}
  • Corollary 2
  • proof
  • proof : Proof of \ref{['thm.boltembedding']}
  • Theorem 5
  • ...and 6 more