Table of Contents
Fetching ...

Plots Unlock Time-Series Understanding in Multimodal Models

Mayank Daswani, Mathias M. J. Bellaiche, Marc Wilson, Desislav Ivanov, Mikhail Papkov, Eva Schnider, Jing Tang, Kay Lamerigts, Gabriela Botea, Michael A. Sanchez, Yojan Patel, Shruthi Prabhakara, Shravya Shetty, Umesh Telang

TL;DR

The paper shows that multimodal foundation models can better understand time-series data by interpreting plots through their vision encoders rather than processing raw numeric sequences as text, all without additional training. Using structured prompting and a carefully designed methodology, the authors demonstrate substantial performance gains across synthetic tasks and real-world IMU-based tasks (fall detection, activity recognition, readiness) and reveal meaningful token-cost savings. The approach relies on four methodological pillars—structured prompts, diverse base models, floating-point representations, and robust statistics—yielding a generalizable time-series encoder via plotting. This plot-centric strategy offers a practical, training-free path to leverage existing multimodal models for time-series reasoning in diverse domains, with further work focusing on plotting optimization and explainability.

Abstract

While multimodal foundation models can now natively work with data beyond text, they remain underutilized in analyzing the considerable amounts of multi-dimensional time-series data in fields like healthcare, finance, and social sciences, representing a missed opportunity for richer, data-driven insights. This paper proposes a simple but effective method that leverages the existing vision encoders of these models to "see" time-series data via plots, avoiding the need for additional, potentially costly, model training. Our empirical evaluations show that this approach outperforms providing the raw time-series data as text, with the additional benefit that visual time-series representations demonstrate up to a 90% reduction in model API costs. We validate our hypothesis through synthetic data tasks of increasing complexity, progressing from simple functional form identification on clean data, to extracting trends from noisy scatter plots. To demonstrate generalizability from synthetic tasks with clear reasoning steps to more complex, real-world scenarios, we apply our approach to consumer health tasks - specifically fall detection, activity recognition, and readiness assessment - which involve heterogeneous, noisy data and multi-step reasoning. The overall success in plot performance over text performance (up to an 120% performance increase on zero-shot synthetic tasks, and up to 150% performance increase on real-world tasks), across both GPT and Gemini model families, highlights our approach's potential for making the best use of the native capabilities of foundation models.

Plots Unlock Time-Series Understanding in Multimodal Models

TL;DR

The paper shows that multimodal foundation models can better understand time-series data by interpreting plots through their vision encoders rather than processing raw numeric sequences as text, all without additional training. Using structured prompting and a carefully designed methodology, the authors demonstrate substantial performance gains across synthetic tasks and real-world IMU-based tasks (fall detection, activity recognition, readiness) and reveal meaningful token-cost savings. The approach relies on four methodological pillars—structured prompts, diverse base models, floating-point representations, and robust statistics—yielding a generalizable time-series encoder via plotting. This plot-centric strategy offers a practical, training-free path to leverage existing multimodal models for time-series reasoning in diverse domains, with further work focusing on plotting optimization and explainability.

Abstract

While multimodal foundation models can now natively work with data beyond text, they remain underutilized in analyzing the considerable amounts of multi-dimensional time-series data in fields like healthcare, finance, and social sciences, representing a missed opportunity for richer, data-driven insights. This paper proposes a simple but effective method that leverages the existing vision encoders of these models to "see" time-series data via plots, avoiding the need for additional, potentially costly, model training. Our empirical evaluations show that this approach outperforms providing the raw time-series data as text, with the additional benefit that visual time-series representations demonstrate up to a 90% reduction in model API costs. We validate our hypothesis through synthetic data tasks of increasing complexity, progressing from simple functional form identification on clean data, to extracting trends from noisy scatter plots. To demonstrate generalizability from synthetic tasks with clear reasoning steps to more complex, real-world scenarios, we apply our approach to consumer health tasks - specifically fall detection, activity recognition, and readiness assessment - which involve heterogeneous, noisy data and multi-step reasoning. The overall success in plot performance over text performance (up to an 120% performance increase on zero-shot synthetic tasks, and up to 150% performance increase on real-world tasks), across both GPT and Gemini model families, highlights our approach's potential for making the best use of the native capabilities of foundation models.
Paper Structure (37 sections, 20 figures, 37 tables)

This paper contains 37 sections, 20 figures, 37 tables.

Figures (20)

  • Figure 1: Zero-shot synthetic data results showing plot- and text-based accuracy (MAE for the cluster counting task) distributions for all models, with horizontal lines representing random performance. The results generally show better performance for plots compared to text across models.
  • Figure 2: Quadratic derivative identification results show zero-shot plots outperform text, except for the outlier GPT4o model. When using few-shots, more examples generally improves the gain.
  • Figure 3: Results of fall detection task show consistently better plot performances across models and number of few-shots, with plot performance generally increasing with number of shots. The top plot models have 10-shot (sensitivity, specificity) as follows: Gemini Pro 1.5 - (0.84, 0.95) and GPT4o - (0.92, 0.81), compared to the state-of-the-art task-specific support-vector machine model reported by aziz2017comparison which achieves (0.96, 0.96) (see Supplementary Table \ref{['tab:falldetection_sota']} for more details).
  • Figure 4: Results of activity recognition task for all models across few-shot numbers (where context length allowed), showing overall improved performance for plots. The performance of the state-of-the-art deep-learning model reported by deeptransHHAR2022 is included for reference.
  • Figure 5: Results of readiness task for Gemini models only (as the dataset cannot be sent to other models), demonstrating approximate parity between the text and plot approaches.
  • ...and 15 more figures