Table of Contents
Fetching ...

One-for-All: A Lightweight Stabilized and Parameter-Efficient Pre-trained LLM for Time Series Forecasting

Prasanjit Dey, Soumyabrata Dev, Bianca Schoen-Phelan

Abstract

We address the challenge of adapting pre-trained Large Language Models (LLMs) for multivariate time-series analysis, where their deployment is often hindered by prohibitive computational and memory demands. Our solution, One-for-All, introduces Gaussian Rank-Stabilized Low-Rank Adapters (rsLoRA) to enable parameter-efficient fine-tuning of frozen LLMs. While inspired by LoRA, rsLoRA introduces a mathematically grounded rank-stabilization mechanism that enables provable gradient stability at low ranks a novel contribution absent in prior PEFT methods. Our framework injects trainable rank decomposition matrices (rank 16) into positional embeddings and output layers, while keeping self-attention weights fixed. This design reduces trainable parameters by 6.8$\times$ (vs. TimesNet), 21$\times$ (vs. GPT4TS), and 11.8$\times$ (vs. TIME-LLM), while achieving a 168-1,776$\times$ smaller memory footprint (2.2MiB vs. 340MiB-4.18GiB in SOTA models). Rigorous evaluation across six time-series tasks demonstrates that One-for-All achieves state-of-the-art efficiency-accuracy trade-offs: 5.5$\times$ higher parameter efficiency (MSE=5.50) than TimesNet and 21$\times$ better than GPT4TS, while matching their forecasting accuracy (MSE=0.33). The framework's stability is validated through consistent performance across diverse horizons (96-720 steps) and datasets (ETT, Weather, M3, M4), with 98.3% fewer parameters than conventional transformers. These advances enable deployment on edge devices for healthcare, finance, and environmental monitoring without compromising performance.

One-for-All: A Lightweight Stabilized and Parameter-Efficient Pre-trained LLM for Time Series Forecasting

Abstract

We address the challenge of adapting pre-trained Large Language Models (LLMs) for multivariate time-series analysis, where their deployment is often hindered by prohibitive computational and memory demands. Our solution, One-for-All, introduces Gaussian Rank-Stabilized Low-Rank Adapters (rsLoRA) to enable parameter-efficient fine-tuning of frozen LLMs. While inspired by LoRA, rsLoRA introduces a mathematically grounded rank-stabilization mechanism that enables provable gradient stability at low ranks a novel contribution absent in prior PEFT methods. Our framework injects trainable rank decomposition matrices (rank 16) into positional embeddings and output layers, while keeping self-attention weights fixed. This design reduces trainable parameters by 6.8 (vs. TimesNet), 21 (vs. GPT4TS), and 11.8 (vs. TIME-LLM), while achieving a 168-1,776 smaller memory footprint (2.2MiB vs. 340MiB-4.18GiB in SOTA models). Rigorous evaluation across six time-series tasks demonstrates that One-for-All achieves state-of-the-art efficiency-accuracy trade-offs: 5.5 higher parameter efficiency (MSE=5.50) than TimesNet and 21 better than GPT4TS, while matching their forecasting accuracy (MSE=0.33). The framework's stability is validated through consistent performance across diverse horizons (96-720 steps) and datasets (ETT, Weather, M3, M4), with 98.3% fewer parameters than conventional transformers. These advances enable deployment on edge devices for healthcare, finance, and environmental monitoring without compromising performance.

Paper Structure

This paper contains 27 sections, 1 theorem, 11 equations, 8 figures, 7 tables.

Key Result

Theorem 1

Consider a pre-trained language model with an adapter $\beta_{r}YX$, where $Y \in \mathbb{R}^{d \times r}$ is initialized to $0_{d \times r}$, and entries of $X \in \mathbb{R}^{r \times d^\prime}$ are i.i.d. Gaussian random variables with zero mean and variance $\sigma_{X}^{2}$. Here, $d$ and $d^\pr

Figures (8)

  • Figure 1: One-for-All Framework: A parameter-efficient LLM unifying long-term, few-shot, zero-shot, short-term forecasting, classification, and anomaly detection. By integrating Gaussian rank-stabilized LoRA (rsLoRA) into positional embeddings and output layers while freezing the pre-trained LLM weights, we minimize trainable parameters without compromising stability.
  • Figure 2: Comparison of model efficiency across different forecast horizons. (a) Trainable parameters (in millions, log scale) and (b) Model size (in MiB, log scale) for various time-series forecasting approaches. The One-for-All model (highlighted in red) demonstrates consistently low parameter counts and memory footprint across all horizons, while other methods (dashed lines) show varying computational requirements. Notably, large pretrained models (e.g., TIME-LLM and GPT4TS) exhibit significantly higher resource demands, particularly at longer horizons (336 and 720). The Avg column represents the average across all horizons, further emphasizing the efficiency advantages of the proposed approach.
  • Figure 3: Trade-offs Between Model Accuracy, Efficiency, and Scalability for Long-Term Forecasting. The average MSE (y-axis) measures prediction accuracy (lower is better), while the number of trainable parameters (x-axis) reflects model efficiency (leftward is better). The bubble sizes represent memory usage (smaller is better), highlighting scalability constraints.
  • Figure 4: Trade-offs Between Model Accuracy, Efficiency, and Scalability for few-shot Forecasting with 10% data. The average MSE (y-axis) measures prediction accuracy (lower is better), while the number of trainable parameters (x-axis) reflects model efficiency (leftward is better). The bubble sizes represent memory usage (smaller is better), highlighting scalability constraints.
  • Figure 5: Accuracy (%) of One-for-All (red) versus baseline models (blue) across six time-series datasets. The proposed model performs robustly, matching or exceeding specialized approaches in most tasks, particularly on Japanese Vowels (98%) and SCP1 (93%). Blue colors denote comparative baselines.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Definition 1
  • Theorem 1