Time Series Language Model for Descriptive Caption Generation

Mohamed Trabelsi; Aidan Boyd; Jin Cao; Huseyin Uzunalioglu

Time Series Language Model for Descriptive Caption Generation

Mohamed Trabelsi, Aidan Boyd, Jin Cao, Huseyin Uzunalioglu

TL;DR

This work addresses the challenge of describing time series patterns in natural language by introducing TSLM, a multi-modal encoder–decoder that fuses textual and time-series embeddings. TS LM leverages phase-tagged text, a time-series CNN encoder, and a cross-modal reprogramming mechanism to align modalities, while training is aided by synthetic in-context data generation and a cross-modal denoising step to filter noisy captions. Empirical results on STOCK and SYNTH show that TSLM outperforms baselines across multiple metrics, with ablations confirming the value of joint representations and data denoising. The approach demonstrates a practical path for integrating time series analysis with large language models, producing descriptive captions that can be further summarized by LLMs for end-user interpretation.

Abstract

The automatic generation of representative natural language descriptions for observable patterns in time series data enhances interpretability, simplifies analysis and increases cross-domain utility of temporal data. While pre-trained foundation models have made considerable progress in natural language processing (NLP) and computer vision (CV), their application to time series analysis has been hindered by data scarcity. Although several large language model (LLM)-based methods have been proposed for time series forecasting, time series captioning is under-explored in the context of LLMs. In this paper, we introduce TSLM, a novel time series language model designed specifically for time series captioning. TSLM operates as an encoder-decoder model, leveraging both text prompts and time series data representations to capture subtle temporal patterns across multiple phases and generate precise textual descriptions of time series inputs. TSLM addresses the data scarcity problem in time series captioning by first leveraging an in-context prompting synthetic data generation, and second denoising the generated data via a novel cross-modal dense retrieval scoring applied to time series-caption pairs. Experimental findings on various time series captioning datasets demonstrate that TSLM outperforms existing state-of-the-art approaches from multiple data modalities by a significant margin.

Time Series Language Model for Descriptive Caption Generation

TL;DR

Abstract

Paper Structure (39 sections, 14 equations, 7 figures, 3 tables)

This paper contains 39 sections, 14 equations, 7 figures, 3 tables.

Introduction
Related work
Time Series Captioning
Multi-modal Models
Problem Statement
TSLM: Multi-modal Encoder
Time Series Representations
Textual Representations
Embedding Representations
Joint Representations
Multi-Modal Encoder Architecture
TSLM: Training with Denoised Generated Data
In-Context Prompting Data Generation
Time Series 1D CNN Autoencoder
Denoise Generated Data via Cross-Modal Dense Retrieval Scoring
...and 24 more sections

Figures (7)

Figure 1: The overview of training TSLM, which is composed of four key steps: (a) In-context prompting data generation; (b) Time series 1D CNN autoencoder; (c) Denoise generated data via cross-modal dense retrieval scoring; and the final step (d) is the training of TSLM with the denoised generated data.
Figure 2: The overview of generating a descriptive caption. The joint representation of the unseen time series is extracted by combining the textual and embedding representations, then TSLM generates $K$ captions that are summarized using LLaMA2-13B-Chat to obtain the final descriptive caption.
Figure 3: STOCK data and generated captions. TSLM generates 3 captions that are summarized using LLaMA2-13B-Chat to obtain a descriptive caption. TSLM generates precise and accurate captions that describe multiple phases and patterns of the time series.
Figure 4: Examples of generated time series-caption pairs with their predicted denoising scores computed from the cross-modal dense retrieval model. The first row represents noisy generated samples that are assigned a low score from the denoising model, and by consequence these samples are removed to denoise the generated data. The second row represents high-quality generated data that are assigned a high score from the denoising model, and by consequence these samples are kept.
Figure 5: Denoising scores distribution of the generated data. The distribution of denoising scores is approximated with a normal distribution with mean $\mu$ = 3.37 and standard deviation $\sigma$ = 2.44.
...and 2 more figures

Time Series Language Model for Descriptive Caption Generation

TL;DR

Abstract

Time Series Language Model for Descriptive Caption Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)