Table of Contents
Fetching ...

TimeOmni-VL: Unified Models for Time Series Understanding and Generation

Tong Guan, Sheng Pan, Johan Barthelemy, Zhao Li, Yujun Cai, Cesare Alippi, Ming Jin, Shirui Pan

TL;DR

This work proposes TimeOmni-VL, the first vision-centric framework that unifies time series understanding and generation through two key innovations: Fidelity-preserving bidirectional mapping between time series and images (Bi-TSI), which advances Time Series-to-Image and Image-to-Time Series conversions to ensure near-lossless transformations.

Abstract

Recent time series modeling faces a sharp divide between numerical generation and semantic understanding, with research showing that generation models often rely on superficial pattern matching, while understanding-oriented models struggle with high-fidelity numerical output. Although unified multimodal models (UMMs) have bridged this gap in vision, their potential for time series remains untapped. We propose TimeOmni-VL, the first vision-centric framework that unifies time series understanding and generation through two key innovations: (1) Fidelity-preserving bidirectional mapping between time series and images (Bi-TSI), which advances Time Series-to-Image (TS2I) and Image-to-Time Series (I2TS) conversions to ensure near-lossless transformations. (2) Understanding-guided generation. We introduce TSUMM-Suite, a novel dataset consists of six understanding tasks rooted in time series analytics that are coupled with two generation tasks. With a calibrated Chain-of-Thought, TimeOmni-VL is the first to leverage time series understanding as an explicit control signal for high-fidelity generation. Experiments confirm that this unified approach significantly improves both semantic understanding and numerical precision, establishing a new frontier for multimodal time series modeling.

TimeOmni-VL: Unified Models for Time Series Understanding and Generation

TL;DR

This work proposes TimeOmni-VL, the first vision-centric framework that unifies time series understanding and generation through two key innovations: Fidelity-preserving bidirectional mapping between time series and images (Bi-TSI), which advances Time Series-to-Image and Image-to-Time Series conversions to ensure near-lossless transformations.

Abstract

Recent time series modeling faces a sharp divide between numerical generation and semantic understanding, with research showing that generation models often rely on superficial pattern matching, while understanding-oriented models struggle with high-fidelity numerical output. Although unified multimodal models (UMMs) have bridged this gap in vision, their potential for time series remains untapped. We propose TimeOmni-VL, the first vision-centric framework that unifies time series understanding and generation through two key innovations: (1) Fidelity-preserving bidirectional mapping between time series and images (Bi-TSI), which advances Time Series-to-Image (TS2I) and Image-to-Time Series (I2TS) conversions to ensure near-lossless transformations. (2) Understanding-guided generation. We introduce TSUMM-Suite, a novel dataset consists of six understanding tasks rooted in time series analytics that are coupled with two generation tasks. With a calibrated Chain-of-Thought, TimeOmni-VL is the first to leverage time series understanding as an explicit control signal for high-fidelity generation. Experiments confirm that this unified approach significantly improves both semantic understanding and numerical precision, establishing a new frontier for multimodal time series modeling.
Paper Structure (43 sections, 14 equations, 8 figures, 18 tables)

This paper contains 43 sections, 14 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Comparison of architectures for (a) time series understanding model that produce textual answer only, (b) time series generation model that output time series only, and (c) unified time series understanding and generation model that support both answering queries and generating time series.
  • Figure 2: Overview of the TimeOmni-VL framework. The input time series is first converted into a TS-image $I$ by the (a) TS2I Converter. For understanding tasks, the understanding model directly produces CoT $R$ and the final answer. For generation tasks, the understanding model first generates CoT $R$ as conditions for the generation module to generate the target image $I_{\mathrm{tgt}}$, which is then converted back to a time series by the (b) I2TS Converter. Detailed pipelines of the TS2I and I2TS converters are shown on the right.
  • Figure 3: Illustration of improvements in Bi-TSI. (a) Robust fidelity normalization enables lossless rendering of high-dynamic-range time series by keeping values within the valid pixel range, whereas the baseline in VisionTS++ VisionTS++ can overflow this range and fail to represent spike. (b) Encoding capacity control prevents implicit downsampling when encoding high-dimensional time series, ensuring that the resulting TS-image remains information-preserving, whereas the baseline suffers information loss.
  • Figure 4: Illustrative examples of the proposed TSUMM-Suite, consisting of six time series understanding tasks and two generation tasks. The generation CoT is directly derived from the understanding tasks, explicitly bridging the two task families.
  • Figure 5: Performance on TS-image understanding tasks.
  • ...and 3 more figures