Table of Contents
Fetching ...

UniCast: A Unified Framework for Instance-Conditioned Multimodal Time-Series Forecasting

Sehyuk Park, Soyeon Caren Han, Eduard Hovy

TL;DR

UniCast is proposed, a parameter-efficient multimodal framework that extends TSFMs through instance conditioned prompting and dynamic modality routing, and consistently outperforms all existing TSFM baselines, demonstrating that instance-conditioned multimodal control is critical for next-generation time series forecasting.

Abstract

Time series forecasting underpins applications in finance, healthcare, and environmental monitoring. Despite the success of Time Series Foundation Models (TSFMs), existing approaches operate in a unimodal setting and rely on static prompts or fixed fusion schemes, limiting their ability to exploit multimodal context and adapt to instance-level variation. We propose UniCast, a parameter-efficient multimodal framework that extends TSFMs through instance conditioned prompting and dynamic modality routing. UniCast infers a conditional prompt from time series, vision, and text inputs via a Transformer-based contextual distiller, enabling input-specific adaptation without updating the forecasting backbone. To regulate how auxiliary modalities influence predictions, UniCast employs Modality Routing, a cross-attention mechanism that estimates modality relevance given the current temporal state and selectively amplifies informative signals while suppressing noise. Integrated with a frozen TSFM via soft prompt tuning, UniCast preserves foundation-level generalization while enabling effective multimodal control. Extensive experiments across diverse forecasting benchmarks show that UniCast consistently outperforms all existing TSFM baselines, demonstrating that instance-conditioned multimodal control is critical for next-generation time series forecasting.

UniCast: A Unified Framework for Instance-Conditioned Multimodal Time-Series Forecasting

TL;DR

UniCast is proposed, a parameter-efficient multimodal framework that extends TSFMs through instance conditioned prompting and dynamic modality routing, and consistently outperforms all existing TSFM baselines, demonstrating that instance-conditioned multimodal control is critical for next-generation time series forecasting.

Abstract

Time series forecasting underpins applications in finance, healthcare, and environmental monitoring. Despite the success of Time Series Foundation Models (TSFMs), existing approaches operate in a unimodal setting and rely on static prompts or fixed fusion schemes, limiting their ability to exploit multimodal context and adapt to instance-level variation. We propose UniCast, a parameter-efficient multimodal framework that extends TSFMs through instance conditioned prompting and dynamic modality routing. UniCast infers a conditional prompt from time series, vision, and text inputs via a Transformer-based contextual distiller, enabling input-specific adaptation without updating the forecasting backbone. To regulate how auxiliary modalities influence predictions, UniCast employs Modality Routing, a cross-attention mechanism that estimates modality relevance given the current temporal state and selectively amplifies informative signals while suppressing noise. Integrated with a frozen TSFM via soft prompt tuning, UniCast preserves foundation-level generalization while enabling effective multimodal control. Extensive experiments across diverse forecasting benchmarks show that UniCast consistently outperforms all existing TSFM baselines, demonstrating that instance-conditioned multimodal control is critical for next-generation time series forecasting.

Paper Structure

This paper contains 32 sections, 1 equation, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: Radar plot showing per-dataset model rankings (1 = best) comparing UniCast (Chronos) with strong time-series foundation model baselines under zero-shot (ZS) and full fine-tuned (FT) settings. UniCast consistently attains top/near-top ranks across heterogeneous domains, highlighting the benefit of instance-conditioned multimodal control over static or unimodal forecasting approaches.
  • Figure 2: UniCast formulates multimodal time-series forecasting as an instance-level modality credit assignment problem. (Left) Time series, vision, and text inputs are encoded by frozen pretrained backbones to preserve foundation-level generalization. (Middle) Conditional Prompting distills cross-modal context to generate instance-specific soft prompts, enabling adaptive conditioning under temporal non-stationarity. (Right) The Modality Routing mechanism performs iterative, input-dependent credit assignment via cross-attention, selectively injecting informative auxiliary signals into the TSFM while suppressing irrelevant or noisy modalities during forecasting.
  • Figure 3: Ablation study on Conditional Prompting and Modality Routing. We compare zero-shot (ZS), full fine-tuning (FT), and UniCast variants with different design components: without CP and MR (None), with CP only, with MR only, and with both components enabled (All). Both CP and MR individually improve performance over static or heuristic baselines, while combining both components consistently yields the lowest forecasting error, highlighting their complementary roles in instance-conditioned multimodal adaptation. Exact values can be found at Table \ref{['tab:cpmrablation']} in Appendix \ref{['app:cpmrablation']}.
  • Figure 4: Attention heatmaps of the BLIP vision backbone. Each subfigure visualizes attention patterns at progressively deeper layers, from early (Layer 1 to 3) to later (Layer 10 to 12) stages of processing, illustrating how visual relevance evolves during prediction.
  • Figure 5: Forecasting examples. Subfigures show results on five datasets: (a) Aus-Elec, (b) Hospital, (c) ETTh1, (d) ETTh2 and (e) ETTm2. Each subfigure compares the forecasts produced by a fully fine-tuned time-series model (FT) and UniCast(Ours) against the Ground Truth. The black curve denotes the input context, while colored curves correspond to predictions over the forecast horizon.
  • ...and 3 more figures