Table of Contents
Fetching ...

LiPCoT: Linear Predictive Coding based Tokenizer for Self-supervised Learning of Time Series Data via Language Models

Md Fahim Anjum

TL;DR

LiPCoT introduces a linear predictive coding–based tokenizer to convert time series into discrete tokens suitable for NLP models like BERT, enabling self-supervised learning on time-series data. By constructing latent spaces from LPC, cepstral, and dominant spectral representations and clustering them into a token vocabulary, LiPCoT facilitates effective self-supervised pretraining and downstream tasks. In a Parkinson's disease EEG classification study, LiPCoT–BERT outperformed four CNN-based baselines across precision, recall, accuracy, AUC, and F1, demonstrating strong benefits even on relatively small datasets and highlighting invariance to sampling rate and length. The work points to a promising direction for scalable time-series foundation models and self-supervised learning through LPC-based tokenization.

Abstract

Language models have achieved remarkable success in various natural language processing tasks. However, their application to time series data, a crucial component in many domains, remains limited. This paper proposes LiPCoT (Linear Predictive Coding based Tokenizer for time series), a novel tokenizer that encodes time series data into a sequence of tokens, enabling self-supervised learning of time series using existing Language model architectures such as BERT. Unlike traditional time series tokenizers that rely heavily on CNN encoder for time series feature generation, LiPCoT employs stochastic modeling through linear predictive coding to create a latent space for time series providing a compact yet rich representation of the inherent stochastic nature of the data. Furthermore, LiPCoT is computationally efficient and can effectively handle time series data with varying sampling rates and lengths, overcoming common limitations of existing time series tokenizers. In this proof-of-concept work, we present the effectiveness of LiPCoT in classifying Parkinson's disease (PD) using an EEG dataset from 46 participants. In particular, we utilize LiPCoT to encode EEG data into a small vocabulary of tokens and then use BERT for self-supervised learning and the downstream task of PD classification. We benchmark our approach against several state-of-the-art CNN-based deep learning architectures for PD detection. Our results reveal that BERT models utilizing self-supervised learning outperformed the best-performing existing method by 7.1% in precision, 2.3% in recall, 5.5% in accuracy, 4% in AUC, and 5% in F1-score highlighting the potential for self-supervised learning even on small datasets. Our work will inform future foundational models for time series, particularly for self-supervised learning.

LiPCoT: Linear Predictive Coding based Tokenizer for Self-supervised Learning of Time Series Data via Language Models

TL;DR

LiPCoT introduces a linear predictive coding–based tokenizer to convert time series into discrete tokens suitable for NLP models like BERT, enabling self-supervised learning on time-series data. By constructing latent spaces from LPC, cepstral, and dominant spectral representations and clustering them into a token vocabulary, LiPCoT facilitates effective self-supervised pretraining and downstream tasks. In a Parkinson's disease EEG classification study, LiPCoT–BERT outperformed four CNN-based baselines across precision, recall, accuracy, AUC, and F1, demonstrating strong benefits even on relatively small datasets and highlighting invariance to sampling rate and length. The work points to a promising direction for scalable time-series foundation models and self-supervised learning through LPC-based tokenization.

Abstract

Language models have achieved remarkable success in various natural language processing tasks. However, their application to time series data, a crucial component in many domains, remains limited. This paper proposes LiPCoT (Linear Predictive Coding based Tokenizer for time series), a novel tokenizer that encodes time series data into a sequence of tokens, enabling self-supervised learning of time series using existing Language model architectures such as BERT. Unlike traditional time series tokenizers that rely heavily on CNN encoder for time series feature generation, LiPCoT employs stochastic modeling through linear predictive coding to create a latent space for time series providing a compact yet rich representation of the inherent stochastic nature of the data. Furthermore, LiPCoT is computationally efficient and can effectively handle time series data with varying sampling rates and lengths, overcoming common limitations of existing time series tokenizers. In this proof-of-concept work, we present the effectiveness of LiPCoT in classifying Parkinson's disease (PD) using an EEG dataset from 46 participants. In particular, we utilize LiPCoT to encode EEG data into a small vocabulary of tokens and then use BERT for self-supervised learning and the downstream task of PD classification. We benchmark our approach against several state-of-the-art CNN-based deep learning architectures for PD detection. Our results reveal that BERT models utilizing self-supervised learning outperformed the best-performing existing method by 7.1% in precision, 2.3% in recall, 5.5% in accuracy, 4% in AUC, and 5% in F1-score highlighting the potential for self-supervised learning even on small datasets. Our work will inform future foundational models for time series, particularly for self-supervised learning.
Paper Structure (39 sections, 36 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 39 sections, 36 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of LiPCoT and its application for PD classification via BERT.
  • Figure 2: Tokenization of time series data via LiPCoT: One-minute data from the validation set from a single EEG channel before (top) and after (bottom) tokenization. Each color represents a unique token. LPC coefficients were utilized for latent space construction with order $L=16$, warping coefficient $\lambda=0.2$.
  • Figure 3: Spectral density of tokenized data segments: Power spectral density of 5-second data segments in a single EEG channel from the validation set colored by their respective LiPCoT tokens. LiPCoT with LPC coefficients, order $L=16$, warping coefficient $\lambda=0.2$.
  • Figure 4: Tokenized data segments: Representative data segments colored by their respective LiPCoT tokens. Each plot shows a single 5-second time series segment. Data from a single EEG channel in the validation set. LiPCoT with LPC coefficients, order $L=16$, warping coefficient $\lambda=0.2$.
  • Figure 5: Illustration of poles from a second order LPC model in z-domain (left) and the corresponding power spectrum with one dominant frequency peak (right).