Measuring Pre-training Data Quality without Labels for Time Series Foundation Models

Songkang Wen; Vasilii Feofanov; Jianfeng Zhang

Measuring Pre-training Data Quality without Labels for Time Series Foundation Models

Songkang Wen, Vasilii Feofanov, Jianfeng Zhang

TL;DR

The paper addresses unsupervised evaluation of pre-training data quality for time-series foundation models by introducing contrastive accuracy $A_{ ext{con}}^{( ext{X}')}( ext{X}_0)$, which captures how well the learned embedding space distributes representations in line with contrastive learning. Using a ViT-based TSFM with overlapping CNN patches, the authors show that CA correlates with downstream accuracy on unseen tasks, enabling data-driven dataset selection without labeled evaluation. They also demonstrate that changes in CA from adding new data ($\Delta A_{ ext{con}}$) predict corresponding performance gains ($\Delta \mathcal{P}$), supporting data-curation guidance for pre-training. This work offers a practical, unsupervised criterion to improve generalization of time-series foundation models and raises questions about augmentation design for time-series contrastive learning.

Abstract

Recently, there has been a growing interest in time series foundation models that generalize across different downstream tasks. A key to strong foundation models is a diverse pre-training dataset, which is particularly challenging to collect for time series classification. In this work, we explore the performance of a contrastive-learning-based foundation model as a function of the data used for pre-training. We introduce contrastive accuracy, a new measure to evaluate the quality of the representation space learned by the foundation model. Our experiments reveal the positive correlation between the proposed measure and the accuracy of the model on a collection of downstream tasks. This suggests that the contrastive accuracy can serve as a criterion to search for time series datasets that can enhance the pre-training and improve thereby the foundation model's generalization.

Measuring Pre-training Data Quality without Labels for Time Series Foundation Models

TL;DR

The paper addresses unsupervised evaluation of pre-training data quality for time-series foundation models by introducing contrastive accuracy

, which captures how well the learned embedding space distributes representations in line with contrastive learning. Using a ViT-based TSFM with overlapping CNN patches, the authors show that CA correlates with downstream accuracy on unseen tasks, enabling data-driven dataset selection without labeled evaluation. They also demonstrate that changes in CA from adding new data (

) predict corresponding performance gains (

), supporting data-curation guidance for pre-training. This work offers a practical, unsupervised criterion to improve generalization of time-series foundation models and raises questions about augmentation design for time-series contrastive learning.

Measuring Pre-training Data Quality without Labels for Time Series Foundation Models

TL;DR

Abstract

Measuring Pre-training Data Quality without Labels for Time Series Foundation Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)