Table of Contents
Fetching ...

Vision-Enhanced Time Series Forecasting via Latent Diffusion Models

Weilin Ruan, Siru Zhong, Haomin Wen, Yuxuan Liang

TL;DR

This work tackles uncertainty-aware long-horizon time series forecasting by reframing forecasting as image reconstruction in a latent diffusion framework. It introduces LDM4TS, which transforms time series into multi-view visual representations (SEG, GAF, RP), encodes these via a frozen latent diffusion model guided by cross-modal conditioning (frequency and text signals), and fuses global and local temporal cues through a temporal projection module. The approach yields state-of-the-art results across diverse datasets, including strong performance in long-term, few-shot, and zero-shot settings, with substantial MSE improvements over competitive baselines. By leveraging vision encoders and probabilistic diffusion in latent space, LDM4TS provides robust uncertainty quantification and scalable forecasting, offering a new pathway for cross-modal temporal modeling in real-world applications.

Abstract

Diffusion models have recently emerged as powerful frameworks for generating high-quality images. While recent studies have explored their application to time series forecasting, these approaches face significant challenges in cross-modal modeling and transforming visual information effectively to capture temporal patterns. In this paper, we propose LDM4TS, a novel framework that leverages the powerful image reconstruction capabilities of latent diffusion models for vision-enhanced time series forecasting. Instead of introducing external visual data, we are the first to use complementary transformation techniques to convert time series into multi-view visual representations, allowing the model to exploit the rich feature extraction capabilities of the pre-trained vision encoder. Subsequently, these representations are reconstructed using a latent diffusion model with a cross-modal conditioning mechanism as well as a fusion module. Experimental results demonstrate that LDM4TS outperforms various specialized forecasting models for time series forecasting tasks.

Vision-Enhanced Time Series Forecasting via Latent Diffusion Models

TL;DR

This work tackles uncertainty-aware long-horizon time series forecasting by reframing forecasting as image reconstruction in a latent diffusion framework. It introduces LDM4TS, which transforms time series into multi-view visual representations (SEG, GAF, RP), encodes these via a frozen latent diffusion model guided by cross-modal conditioning (frequency and text signals), and fuses global and local temporal cues through a temporal projection module. The approach yields state-of-the-art results across diverse datasets, including strong performance in long-term, few-shot, and zero-shot settings, with substantial MSE improvements over competitive baselines. By leveraging vision encoders and probabilistic diffusion in latent space, LDM4TS provides robust uncertainty quantification and scalable forecasting, offering a new pathway for cross-modal temporal modeling in real-world applications.

Abstract

Diffusion models have recently emerged as powerful frameworks for generating high-quality images. While recent studies have explored their application to time series forecasting, these approaches face significant challenges in cross-modal modeling and transforming visual information effectively to capture temporal patterns. In this paper, we propose LDM4TS, a novel framework that leverages the powerful image reconstruction capabilities of latent diffusion models for vision-enhanced time series forecasting. Instead of introducing external visual data, we are the first to use complementary transformation techniques to convert time series into multi-view visual representations, allowing the model to exploit the rich feature extraction capabilities of the pre-trained vision encoder. Subsequently, these representations are reconstructed using a latent diffusion model with a cross-modal conditioning mechanism as well as a fusion module. Experimental results demonstrate that LDM4TS outperforms various specialized forecasting models for time series forecasting tasks.

Paper Structure

This paper contains 71 sections, 52 equations, 11 figures, 13 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparison between traditional TSF methods and vision-enhanced approach, highlighting our method leverages multi-view visual representations to enhance TSF.
  • Figure 2: The framework of our proposed LDM4TS. Time series data is first transformed into complementary visual representations (SEG: Segmentation, GAF: Gramian Angular Field, RP: Recurrence Plot) that encode structural temporal patterns. A conditional latent diffusion model then reconstructs the masked images through iterative denoising guided by cross-modal conditioning (FC: frequency conditioning, TC: textual conditioning). Finally, the reconstructed images are mapped back to time series space with explicit temporal dependencies and implicit patterns.
  • Figure 3: The forward process of LDM4TS.
  • Figure 4: Visualization results of long-term forecasting by LDM4TS model on all datasets under the input-96-predict-96 setting. Detailed comparisons with baselines on the ETTh1 dataset are in the Appendix \ref{['appx:showcases']}.
  • Figure 5: Visualization of multi-view visual representation after transformation. Each row shows one approach, top row: Segmentation (SEG); middle row: Gramian Angular Field (GAF); and bottom row: Recurrence Plot (RP). More detailed results are provided in Appendix \ref{['appx:visualization_picel_space']}.
  • ...and 6 more figures