The Few Govern the Many:Unveiling Few-Layer Dominance for Time Series Models
Xin Qiu, Junlong Tong, Yirong Sun, Yunpu Ma, Xiaoyu Shen
TL;DR
The paper identifies a scaling paradox in large-scale time series models, where enlarging model capacity does not consistently improve forecasting across LLM4TS and TSFMs. It uncovers a few-layer dominance phenomenon, where a small subset of layers drive performance while the rest remain redundant or even counterproductive. The authors introduce a principled Critical Layer Identification method and demonstrate that retaining about 20–30% of layers can match or exceed full-model accuracy while achieving substantial inference speedups (up to $2.7\times$). The approach generalizes across eight SOTA TS models and 13 datasets, suggesting a practical path to more efficient and scalable time series forecasting systems without sacrificing accuracy.
Abstract
Large-scale models are at the forefront of time series (TS) forecasting, dominated by two paradigms: fine-tuning text-based Large Language Models (LLM4TS) and training Time Series Foundation Models (TSFMs) from scratch. Both approaches share a foundational assumption that scaling up model capacity and data volume leads to improved performance. However, we observe a \textit{\textbf{scaling paradox}} in TS models, revealing a puzzling phenomenon that larger models do \emph{NOT} achieve better performance. Through extensive experiments on two model families across four scales (100M to 1.7B parameters) and diverse data (up to 6B observations), we rigorously confirm that the scaling paradox is a pervasive issue. We then diagnose its root cause by analyzing internal representations, identifying a phenomenon we call \textit{few-layer dominance}: only a small subset of layers are functionally important, while the majority are redundant, under-utilized, and can even distract training. Based on this discovery, we propose a practical method to automatically identify and retain only these dominant layers. In our models, retaining only 21\% of the parameters achieves up to a 12\% accuracy improvement and a 2.7$\times$ inference speedup. We validate the universality of our method on 8 prominent SOTA models (LLM4TS and TSFMs, 90M to 6B), showing that retaining less than 30\% of layers achieves comparable or superior accuracy in over 95\% of tasks.
