Table of Contents
Fetching ...

Test-Time Efficient Pretrained Model Portfolios for Time Series Forecasting

Mert Kayaalp, Caner Turkmen, Oleksandr Shchur, Pedro Mercado, Abdul Fatir Ansari, Michael Bohlke-Schneider, Bernie Wang

TL;DR

This paper tackles the high computational cost of large pretrained time-series forecasters by proposing Chroma, a portfolio of small, pretrained forecasters formed by post-training a generalist into frequency- and domain-specialists. At test time, predictions are produced via model selection or greedy ensemble methods, achieving competitive accuracy with far fewer active parameters than monolithic models and with favorable compute-efficiency trade-offs versus test-time fine-tuning. The approach demonstrates that specialist portfolios, aided by post-training, can scale similarly to generalist models and yield interpretability through activation patterns across specialists. Overall, Chroma offers a modular, scalable framework for test-time efficient forecasting that could extend to other domains, offering a practical alternative to best-of-$N$ sampling from a single base model.

Abstract

Is bigger always better for time series foundation models? With the question in mind, we explore an alternative to training a single, large monolithic model: building a portfolio of smaller, pretrained forecasting models. By applying ensembling or model selection over these portfolios, we achieve competitive performance on large-scale benchmarks using much fewer parameters. We explore strategies for designing such portfolios and find that collections of specialist models consistently outperform portfolios of independently trained generalists. Remarkably, we demonstrate that post-training a base model is a compute-effective approach for creating sufficiently diverse specialists, and provide evidences that ensembling and model selection are more compute-efficient than test-time fine-tuning.

Test-Time Efficient Pretrained Model Portfolios for Time Series Forecasting

TL;DR

This paper tackles the high computational cost of large pretrained time-series forecasters by proposing Chroma, a portfolio of small, pretrained forecasters formed by post-training a generalist into frequency- and domain-specialists. At test time, predictions are produced via model selection or greedy ensemble methods, achieving competitive accuracy with far fewer active parameters than monolithic models and with favorable compute-efficiency trade-offs versus test-time fine-tuning. The approach demonstrates that specialist portfolios, aided by post-training, can scale similarly to generalist models and yield interpretability through activation patterns across specialists. Overall, Chroma offers a modular, scalable framework for test-time efficient forecasting that could extend to other domains, offering a practical alternative to best-of- sampling from a single base model.

Abstract

Is bigger always better for time series foundation models? With the question in mind, we explore an alternative to training a single, large monolithic model: building a portfolio of smaller, pretrained forecasting models. By applying ensembling or model selection over these portfolios, we achieve competitive performance on large-scale benchmarks using much fewer parameters. We explore strategies for designing such portfolios and find that collections of specialist models consistently outperform portfolios of independently trained generalists. Remarkably, we demonstrate that post-training a base model is a compute-effective approach for creating sufficiently diverse specialists, and provide evidences that ensembling and model selection are more compute-efficient than test-time fine-tuning.

Paper Structure

This paper contains 32 sections, 4 equations, 19 figures, 14 tables.

Figures (19)

  • Figure 1: A diagram depicting the approaches explored in this work. (a) Typical setup for pretraining, where a single large model is trained on a large corpus composed of many different datasets. At test time, the model can be invoked zero-shot as well as with finetuning at test-time and used to obtain forecasts. (b) Building a model portfolio by training an array of smaller models, where each specialist is trained on a single smaller corpus representing a different modality or domain. At test time, these models are combined via fitting an ensemble or by selecting the best model, and the selected combination (or, single model) is used for forecasting. (c) Specialists can also be trained in two-stages, by post-training the generalist model. We demonstrate that this approach for inducing diversity into the model portfolio results in comparable accuracy, while reducing the training-time compute by an order of magnitude. The approach also leads to the same forecast accuracy as (a), in return for a much smaller number of total parameters used actively for inference.
  • Figure 2: Results on Chronos Benchmark II (BM2) and GIFT-Eval. Results reported are for probabilistic forecasting, with weighted quantile losses (WQL), scaled relative to Seasonal Naive model, aggregated across all data sets using geometric mean. For Chroma, Best refers to performing model selection on a validation set while Ensemble refers to the ensemble selection algorithm.
  • Figure 3: Scaling behavior of Chroma portfolios. Results are presented for performing model selection from a Chroma portfolio of domain or frequency specialists. Each individual run is for an independently trained generalist in the portfolio, with four trials reported per experiment setting. $\alpha$ refers to the slope of the scaling fit, and asterisks denote statistical significance of the fitted coefficient at the 5% level (p < 0.05). Reported results are aggregated across BM2.
  • Figure 4: Efficient frontier of Chroma portfolios, compared to using a single generalist model without any test-time computation (i.e., zero-shot, $\textcolor{gray}{\times}$) and fine-tuning a generalist model ($\textcolor{chromablue}{\blacksquare}$). Test-time computation is computed as the estimated FLOPs for performing model fitting, fine tuning, or model selection in addition to a single forward pass for inference, per time series in the test set. Reported results are aggregated across BM2.
  • Figure 5: Distribution of ensemble weights for 4m specialist portfolios across distinct tasks, grouped with respect to the their domain or frequency information.
  • ...and 14 more figures