Table of Contents
Fetching ...

UniDiff: A Unified Diffusion Framework for Multimodal Time Series Forecasting

Da Zhang, Bingyu Li, Zhuyuan Zhao, Junyu Gao, Feiping Nie, Xuelong Li

TL;DR

UniDiff tackles multimodal time series forecasting by introducing a unified diffusion framework that fuses numeric sequences, textual context, and timestamps through a parallel cross-attention fusion module. It tokenizes time series into patches, employs a three-branch encoders + fusion head, and adds a decoupled classifier-free guidance mechanism for independent control over text and timestamp conditioning. Empirical results across eight real-world datasets report state-of-the-art performance with strong robustness and favorable efficiency, significantly outperforming prior unimodal and multimodal baselines. The work highlights the value of flexible cross-modal integration and controllable conditioning for context-aware, probabilistic forecasting in diverse domains.

Abstract

As multimodal data proliferates across diverse real-world applications, leveraging heterogeneous information such as texts and timestamps for accurate time series forecasting (TSF) has become a critical challenge. While diffusion models demonstrate exceptional performance in generation tasks, their application to TSF remains largely confined to modeling single-modality numerical sequences, overlooking the abundant cross-modal signals inherent in complex heterogeneous data. To address this gap, we propose UniDiff, a unified diffusion framework for multimodal time series forecasting. To process the numerical sequence, our framework first tokenizes the time series into patches, preserving local temporal dynamics by mapping each patch to an embedding space via a lightweight MLP. At its core lies a unified and parallel fusion module, where a single cross-attention mechanism adaptively weighs and integrates structural information from timestamps and semantic context from texts in one step, enabling a flexible and efficient interplay between modalities. Furthermore, we introduce a novel classifier-free guidance mechanism designed for multi-source conditioning, allowing for decoupled control over the guidance strength of textual and temporal information during inference, which significantly enhances model robustness. Extensive experiments on real-world benchmark datasets across eight domains demonstrate that the proposed UniDiff model achieves state-of-the-art performance.

UniDiff: A Unified Diffusion Framework for Multimodal Time Series Forecasting

TL;DR

UniDiff tackles multimodal time series forecasting by introducing a unified diffusion framework that fuses numeric sequences, textual context, and timestamps through a parallel cross-attention fusion module. It tokenizes time series into patches, employs a three-branch encoders + fusion head, and adds a decoupled classifier-free guidance mechanism for independent control over text and timestamp conditioning. Empirical results across eight real-world datasets report state-of-the-art performance with strong robustness and favorable efficiency, significantly outperforming prior unimodal and multimodal baselines. The work highlights the value of flexible cross-modal integration and controllable conditioning for context-aware, probabilistic forecasting in diverse domains.

Abstract

As multimodal data proliferates across diverse real-world applications, leveraging heterogeneous information such as texts and timestamps for accurate time series forecasting (TSF) has become a critical challenge. While diffusion models demonstrate exceptional performance in generation tasks, their application to TSF remains largely confined to modeling single-modality numerical sequences, overlooking the abundant cross-modal signals inherent in complex heterogeneous data. To address this gap, we propose UniDiff, a unified diffusion framework for multimodal time series forecasting. To process the numerical sequence, our framework first tokenizes the time series into patches, preserving local temporal dynamics by mapping each patch to an embedding space via a lightweight MLP. At its core lies a unified and parallel fusion module, where a single cross-attention mechanism adaptively weighs and integrates structural information from timestamps and semantic context from texts in one step, enabling a flexible and efficient interplay between modalities. Furthermore, we introduce a novel classifier-free guidance mechanism designed for multi-source conditioning, allowing for decoupled control over the guidance strength of textual and temporal information during inference, which significantly enhances model robustness. Extensive experiments on real-world benchmark datasets across eight domains demonstrate that the proposed UniDiff model achieves state-of-the-art performance.

Paper Structure

This paper contains 27 sections, 21 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The landscape of Multimodal TSF. Our proposed UniDiff framework processes core numerical series alongside rich contextual information from texts and structural cues from timestamps to generate robust predictions.
  • Figure 2: Different architectural for TSF. (a) Single-modality models rely solely on historical numerical data, ignoring rich contextual information. (b) Simplistic fusion models use rudimentary techniques like concatenation, failing to capture complex inter-modal dynamics. (c) LLM-centric conversion models transform all data into text, risking the loss of precision and inherent characteristics of numerical data. (d) our proposed UniDiff framework employs a unified and parallel fusion mechanism to adaptively integrate numerical, temporal, and textual information, overcoming the limitations of prior architectures.
  • Figure 3: The overall architecture of the proposed UniDiff framework. The model processes three input modalities: Text ($\mathcal{D}$), the time series Sequence ($\mathcal{X}$), and Timestamps ($\mathcal{T}$). These inputs are first encoded into feature representations ($d$, $z$, and $t$). The core of the model consists of: (a) a Multimodal Fusion module that uses a unified cross-attention mechanism to integrate textual and temporal context into the sequence representation, and (b) a Prediction Head that generates the denoised estimate for the current step. This estimate conditions the iterative Diffusive Process to produce the final forecast.
  • Figure 4: Visualization of cross-attention weights in the Unified Fusion Module. The heatmaps contrast the attention focus of the full UniDiff model against ablated variants. (a, b) Event-Driven Scenario: The full model (a) distinctively attends to textual cues to capture unexpected flu outbreaks, whereas the 'w/o Text' variant (b) lacks this focal point. (c, d) Pattern-Driven Scenario: The full model (c) leverages timestamp embeddings to capture seasonal periodicity, while the 'w/o Timestamp' variant (d) exhibits a scattered attention pattern, failing to recognize the temporal structure.
  • Figure 5: Efficiency comparison, evaluated with prediction horizon of 336 and batch size of 1. Left: MSE and computational complexity (MACs), with bubble size indicating parameter count. Right: MAE and inference speed (ms), with bubble size representing peak GPU memory usage.
  • ...and 4 more figures