Can VLM Pseudo-Labels Train a Time-Series QA Model That Outperforms the VLM?
Takuya Fujimura, Kota Dohi, Natsuo Yamashita, Yohei Kawaguchi
TL;DR
The paper tackles the scarcity of labeled data for domain-independent time-series question answering (TSQA) by leveraging pseudo labels generated by a vision-language model (VLM) from time-series plots. It proposes a training pipeline where a time-series encoder produces embeddings that, together with VLM-derived pseudo labels, train a TSQA model, with the VLM (GPT-4o) serving as the 'teacher' and a frozen LLM (Mistral-7B-Instruct-v0.1) acting as the text backbone. Empirically, TSQA trained on pseudo labels (TSQA-PL) closely approaches the ground-truth upper bound and, on some settings, surpasses the VLM itself, demonstrating robustness to noisy supervision when large unlabeled data are available. The study also analyzes data requirements and mislabeling patterns, highlighting the dependence on VLM quality and suggesting that larger data scales can mitigate noise-induced degradation, thereby offering a scalable path for domain-independent TSQA. The approach has practical implications for building TSQA systems without expensive labeled data, while acknowledging limitations tied to VLM performance on complex signals.
Abstract
Time-series question answering (TSQA) tasks face significant challenges due to the lack of labeled data. Alternatively, with recent advancements in large-scale models, vision-language models (VLMs) have demonstrated the potential to analyze time-series signals in a zero-shot manner. In this paper, we propose a training approach that uses pseudo labels generated by a VLM. Although VLMs can produce incorrect labels, TSQA models can still be effectively trained based on the property that deep neural networks are inherently robust to such noisy labels. Our experimental results demonstrate that TSQA models are not only successfully trained with pseudo labels, but also surpass the performance of the VLM itself by leveraging a large amount of unlabeled data.
