Table of Contents
Fetching ...

On the Feasibility of Vision-Language Models for Time-Series Classification

Vinay Prithyani, Mohsin Mohammed, Richa Gadgil, Ricardo Buitrago, Vinija Jain, Aman Chadha

TL;DR

It is found that VLMs produce competitive results after two or less epochs of fine-tuning and a novel approach that incorporates graphical data representations as images in conjunction with numerical data is developed.

Abstract

We build upon time-series classification by leveraging the capabilities of Vision Language Models (VLMs). We find that VLMs produce competitive results after two or less epochs of fine-tuning. We develop a novel approach that incorporates graphical data representations as images in conjunction with numerical data. This approach is rooted in the hypothesis that graphical representations can provide additional contextual information that numerical data alone may not capture. Additionally, providing a graphical representation can circumvent issues such as limited context length faced by LLMs. To further advance this work, we implemented a scalable end-to-end pipeline for training on different scenarios, allowing us to isolate the most effective strategies for transferring learning capabilities from LLMs to Time Series Classification (TSC) tasks. Our approach works with univariate and multivariate time-series data. In addition, we conduct extensive and practical experiments to show how this approach works for time-series classification and generative labels.

On the Feasibility of Vision-Language Models for Time-Series Classification

TL;DR

It is found that VLMs produce competitive results after two or less epochs of fine-tuning and a novel approach that incorporates graphical data representations as images in conjunction with numerical data is developed.

Abstract

We build upon time-series classification by leveraging the capabilities of Vision Language Models (VLMs). We find that VLMs produce competitive results after two or less epochs of fine-tuning. We develop a novel approach that incorporates graphical data representations as images in conjunction with numerical data. This approach is rooted in the hypothesis that graphical representations can provide additional contextual information that numerical data alone may not capture. Additionally, providing a graphical representation can circumvent issues such as limited context length faced by LLMs. To further advance this work, we implemented a scalable end-to-end pipeline for training on different scenarios, allowing us to isolate the most effective strategies for transferring learning capabilities from LLMs to Time Series Classification (TSC) tasks. Our approach works with univariate and multivariate time-series data. In addition, we conduct extensive and practical experiments to show how this approach works for time-series classification and generative labels.

Paper Structure

This paper contains 29 sections, 12 figures, 7 tables, 2 algorithms.

Figures (12)

  • Figure 1: Pipeline Overview: Workflow for running multiple scenarios from the UCR Time Series Archive
  • Figure 2: System Architecture: The top left depicts Time Series images fed into the Vision Encoder, combined with a tokenized Text Prompt, and both processed through a shared language model (Vicuna) to perform Time Series Classification
  • Figure 3: Univariate Baseline Prompt Template Example: Prompt consists of a question followed by a comma-separated list of the signal
  • Figure 4: Univariate With Stats Prompt Template Example: Prompt consists of a question followed by a comma-separated list of the signal and ending with basic statistics calculated on the signal
  • Figure 5: Multivariate Baseline Prompt Template Example: Prompt consists of a question followed by a comma-separated list of the signal across each dimension
  • ...and 7 more figures