Table of Contents
Fetching ...

A Language Model-Guided Framework for Mining Time Series with Distributional Shifts

Haibei Zhu, Yousef El-Laham, Elizabeth Fons, Svitlana Vyetrenko

TL;DR

The paper tackles the problem of scarce time series data reflecting distributional shifts and privacy constraints. It introduces a language model-guided dataset mining pipeline that uses LLMs and public data APIs to identify, collect, prune, and augment external time series with similar statistical properties. A data-pruning step via offline change point detection and three augmentation techniques yields a diverse cross-domain collection, used to fine-tune forecasting foundation models such as Lag-Llama and Chronos. Experiments show fine-tuning on the collected data provides competitive or improved forecasting performance under distributional shifts, demonstrating the practical value of augmenting scarce data for robust time series analysis.

Abstract

Effective utilization of time series data is often constrained by the scarcity of data quantity that reflects complex dynamics, especially under the condition of distributional shifts. Existing datasets may not encompass the full range of statistical properties required for robust and comprehensive analysis. And privacy concerns can further limit their accessibility in domains such as finance and healthcare. This paper presents an approach that utilizes large language models and data source interfaces to explore and collect time series datasets. While obtained from external sources, the collected data share critical statistical properties with primary time series datasets, making it possible to model and adapt to various scenarios. This method enlarges the data quantity when the original data is limited or lacks essential properties. It suggests that collected datasets can effectively supplement existing datasets, especially involving changes in data distribution. We demonstrate the effectiveness of the collected datasets through practical examples and show how time series forecasting foundation models fine-tuned on these datasets achieve comparable performance to those models without fine-tuning.

A Language Model-Guided Framework for Mining Time Series with Distributional Shifts

TL;DR

The paper tackles the problem of scarce time series data reflecting distributional shifts and privacy constraints. It introduces a language model-guided dataset mining pipeline that uses LLMs and public data APIs to identify, collect, prune, and augment external time series with similar statistical properties. A data-pruning step via offline change point detection and three augmentation techniques yields a diverse cross-domain collection, used to fine-tune forecasting foundation models such as Lag-Llama and Chronos. Experiments show fine-tuning on the collected data provides competitive or improved forecasting performance under distributional shifts, demonstrating the practical value of augmenting scarce data for robust time series analysis.

Abstract

Effective utilization of time series data is often constrained by the scarcity of data quantity that reflects complex dynamics, especially under the condition of distributional shifts. Existing datasets may not encompass the full range of statistical properties required for robust and comprehensive analysis. And privacy concerns can further limit their accessibility in domains such as finance and healthcare. This paper presents an approach that utilizes large language models and data source interfaces to explore and collect time series datasets. While obtained from external sources, the collected data share critical statistical properties with primary time series datasets, making it possible to model and adapt to various scenarios. This method enlarges the data quantity when the original data is limited or lacks essential properties. It suggests that collected datasets can effectively supplement existing datasets, especially involving changes in data distribution. We demonstrate the effectiveness of the collected datasets through practical examples and show how time series forecasting foundation models fine-tuned on these datasets achieve comparable performance to those models without fine-tuning.
Paper Structure (14 sections, 2 equations, 4 figures, 3 tables)

This paper contains 14 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The overview of the dataset mining pipeline.
  • Figure 2: Initial prompt
  • Figure 3: Example prompt to generate queries within a certain API
  • Figure 4: Example queries generated by the LLM.