Table of Contents
Fetching ...

Can VLM Pseudo-Labels Train a Time-Series QA Model That Outperforms the VLM?

Takuya Fujimura, Kota Dohi, Natsuo Yamashita, Yohei Kawaguchi

TL;DR

The paper tackles the scarcity of labeled data for domain-independent time-series question answering (TSQA) by leveraging pseudo labels generated by a vision-language model (VLM) from time-series plots. It proposes a training pipeline where a time-series encoder produces embeddings that, together with VLM-derived pseudo labels, train a TSQA model, with the VLM (GPT-4o) serving as the 'teacher' and a frozen LLM (Mistral-7B-Instruct-v0.1) acting as the text backbone. Empirically, TSQA trained on pseudo labels (TSQA-PL) closely approaches the ground-truth upper bound and, on some settings, surpasses the VLM itself, demonstrating robustness to noisy supervision when large unlabeled data are available. The study also analyzes data requirements and mislabeling patterns, highlighting the dependence on VLM quality and suggesting that larger data scales can mitigate noise-induced degradation, thereby offering a scalable path for domain-independent TSQA. The approach has practical implications for building TSQA systems without expensive labeled data, while acknowledging limitations tied to VLM performance on complex signals.

Abstract

Time-series question answering (TSQA) tasks face significant challenges due to the lack of labeled data. Alternatively, with recent advancements in large-scale models, vision-language models (VLMs) have demonstrated the potential to analyze time-series signals in a zero-shot manner. In this paper, we propose a training approach that uses pseudo labels generated by a VLM. Although VLMs can produce incorrect labels, TSQA models can still be effectively trained based on the property that deep neural networks are inherently robust to such noisy labels. Our experimental results demonstrate that TSQA models are not only successfully trained with pseudo labels, but also surpass the performance of the VLM itself by leveraging a large amount of unlabeled data.

Can VLM Pseudo-Labels Train a Time-Series QA Model That Outperforms the VLM?

TL;DR

The paper tackles the scarcity of labeled data for domain-independent time-series question answering (TSQA) by leveraging pseudo labels generated by a vision-language model (VLM) from time-series plots. It proposes a training pipeline where a time-series encoder produces embeddings that, together with VLM-derived pseudo labels, train a TSQA model, with the VLM (GPT-4o) serving as the 'teacher' and a frozen LLM (Mistral-7B-Instruct-v0.1) acting as the text backbone. Empirically, TSQA trained on pseudo labels (TSQA-PL) closely approaches the ground-truth upper bound and, on some settings, surpasses the VLM itself, demonstrating robustness to noisy supervision when large unlabeled data are available. The study also analyzes data requirements and mislabeling patterns, highlighting the dependence on VLM quality and suggesting that larger data scales can mitigate noise-induced degradation, thereby offering a scalable path for domain-independent TSQA. The approach has practical implications for building TSQA systems without expensive labeled data, while acknowledging limitations tied to VLM performance on complex signals.

Abstract

Time-series question answering (TSQA) tasks face significant challenges due to the lack of labeled data. Alternatively, with recent advancements in large-scale models, vision-language models (VLMs) have demonstrated the potential to analyze time-series signals in a zero-shot manner. In this paper, we propose a training approach that uses pseudo labels generated by a VLM. Although VLMs can produce incorrect labels, TSQA models can still be effectively trained based on the property that deep neural networks are inherently robust to such noisy labels. Our experimental results demonstrate that TSQA models are not only successfully trained with pseudo labels, but also surpass the performance of the VLM itself by leveraging a large amount of unlabeled data.

Paper Structure

This paper contains 10 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of the proposed method.
  • Figure 2: Confusion matrices. (a) Results of GPT-4o on the training set (i.e., pseudo labels used in TSQA-PL) and (b) results of TSQA-PL on the test set, averaged over five trials. The colormap shows the recall score for each class.
  • Figure 3: Evaluation results with changing the correct label ratio. Black circles represent individual scores from each of the five trials, the red circles represent the mean score, and the red error bars represent the standard deviation.
  • Figure 4: Evaluation results with changing the number of training samples. Black circles represent individual scores from each of the five trials, the red circles represent the mean score, and the red error bars represent the standard deviation.
  • Figure 5: Visualization of the embedding space for the cubic function signals. (a) Embeddings of the training data annotated with pseudo labels generated by GPT-4o, and (b) embeddings of the test data annotated with predictions from TSQA-PL. We excluded two samples misclassified as exponential growth in the training set and one sample misclassified as convex in the test set. All figures share the same axes. These figures show results from a single trial out of five trials.
  • ...and 1 more figures