Table of Contents
Fetching ...

CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning

Peiyuan Liu, Hang Guo, Tao Dai, Naiqi Li, Jigang Bao, Xudong Ren, Yong Jiang, Shu-Tao Xia

TL;DR

CALF tackles distribution mismatch between textual and time-series modalities in LLM-based MTSF by introducing a two-branch cross-modal fine-tuning framework. It couples a temporal target branch with a textual source branch via the Cross-Modal Match Module, Feature Regularization Loss, and Output Consistency Loss, and uses parameter-efficient training with LoRA and PCA-based synonym clustering to align inputs and outputs. The method achieves state-of-the-art results across long-term and short-term forecasting, as well as few-shot and zero-shot settings, with lower computational cost than prior LLM-based approaches. This work demonstrates that multi-level cross-modal alignment can unlock robust generalization and practical applicability of LLMs in data-scarce time-series forecasting scenarios.

Abstract

Deep learning (e.g., Transformer) has been widely and successfully used in multivariate time series forecasting (MTSF). Unlike existing methods that focus on training models from a single modal of time series input, large language models (LLMs) based MTSF methods with cross-modal text and time series input have recently shown great superiority, especially with limited temporal data. However, current LLM-based MTSF methods usually focus on adapting and fine-tuning LLMs, while neglecting the distribution discrepancy between textual and temporal input tokens, thus leading to sub-optimal performance. To address this issue, we propose a novel Cross-Modal LLM Fine-Tuning (CALF) framework for MTSF by reducing the distribution discrepancy between textual and temporal data, which mainly consists of the temporal target branch with temporal input and the textual source branch with aligned textual input. To reduce the distribution discrepancy, we develop the cross-modal match module to first align cross-modal input distributions. Additionally, to minimize the modality distribution gap in both feature and output spaces, feature regularization loss is developed to align the intermediate features between the two branches for better weight updates, while output consistency loss is introduced to allow the output representations of both branches to correspond effectively. Thanks to the modality alignment, CALF establishes state-of-the-art performance for both long-term and short-term forecasting tasks with low computational complexity, and exhibiting favorable few-shot and zero-shot abilities similar to that in LLMs. Code is available at https://github.com/Hank0626/LLaTA.

CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning

TL;DR

CALF tackles distribution mismatch between textual and time-series modalities in LLM-based MTSF by introducing a two-branch cross-modal fine-tuning framework. It couples a temporal target branch with a textual source branch via the Cross-Modal Match Module, Feature Regularization Loss, and Output Consistency Loss, and uses parameter-efficient training with LoRA and PCA-based synonym clustering to align inputs and outputs. The method achieves state-of-the-art results across long-term and short-term forecasting, as well as few-shot and zero-shot settings, with lower computational cost than prior LLM-based approaches. This work demonstrates that multi-level cross-modal alignment can unlock robust generalization and practical applicability of LLMs in data-scarce time-series forecasting scenarios.

Abstract

Deep learning (e.g., Transformer) has been widely and successfully used in multivariate time series forecasting (MTSF). Unlike existing methods that focus on training models from a single modal of time series input, large language models (LLMs) based MTSF methods with cross-modal text and time series input have recently shown great superiority, especially with limited temporal data. However, current LLM-based MTSF methods usually focus on adapting and fine-tuning LLMs, while neglecting the distribution discrepancy between textual and temporal input tokens, thus leading to sub-optimal performance. To address this issue, we propose a novel Cross-Modal LLM Fine-Tuning (CALF) framework for MTSF by reducing the distribution discrepancy between textual and temporal data, which mainly consists of the temporal target branch with temporal input and the textual source branch with aligned textual input. To reduce the distribution discrepancy, we develop the cross-modal match module to first align cross-modal input distributions. Additionally, to minimize the modality distribution gap in both feature and output spaces, feature regularization loss is developed to align the intermediate features between the two branches for better weight updates, while output consistency loss is introduced to allow the output representations of both branches to correspond effectively. Thanks to the modality alignment, CALF establishes state-of-the-art performance for both long-term and short-term forecasting tasks with low computational complexity, and exhibiting favorable few-shot and zero-shot abilities similar to that in LLMs. Code is available at https://github.com/Hank0626/LLaTA.
Paper Structure (30 sections, 6 equations, 4 figures, 6 tables)

This paper contains 30 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The t-SNE visualization of pre-trained word token embeddings of LLM with the hidden features from the penultimate layer from GPT4TS zhou2023onefitsall, TimeLLM jin2023timellm, TEST sun2024test, and ours of ETTh2 dataset. Current LLM-based methods either use linear layers to project time series to the LLM's feature dimension zhou2023onefitsall or employ cross-attention and contrastive learning techniques jin2023timellmsun2024test, which address only the input side and overlook alignment in the deeper layers. Our CALF achieves better alignment through multi-level cross-modal fine-tuning.
  • Figure 2: An overview of the proposed cross-modal fine-tuning framework. Above is the Textual Source Branch, and below is the Temporal Target Branch. To bridge the modality gap, the framework employs three cross-modal fine-tuning techniques: ① Cross-Modal Match Module, ② Feature Regularization Loss, and ③ Output Consistency Loss.
  • Figure 3: Cross-attention maps from the Cross-Modal Match Module for ETTh1 (left) and ETTh2 (right). Each row represents a time series instance, while columns correspond to selected words, including both time-related terms (e.g., trend, seasonality) and general terms (e.g., echo, key). Each cell indicates the relevance of the respective channel to the selected word.
  • Figure 4: Ablation on different low dimension $d$ of PCA on (a) ETTh1 and (b) ETTh2 datasets.