Table of Contents
Fetching ...

TS-MLLM: A Multi-Modal Large Language Model-based Framework for Industrial Time-Series Big Data Analysis

Haiteng Wang, Yikang Li, Yunfei Zhu, Jingheng Yan, Lei Ren, Laurence T. Yang

TL;DR

TS-MLLM is proposed, a unified multi-modal large language model framework designed to jointly model temporal signals, frequency-domain images, and textual domain knowledge and significantly outperforms state-of-the-art methods for industrial time-series prediction.

Abstract

Accurate analysis of industrial time-series big data is critical for the Prognostics and Health Management (PHM) of industrial equipment. While recent advancements in Large Language Models (LLMs) have shown promise in time-series analysis, existing methods typically focus on single-modality adaptations, failing to exploit the complementary nature of temporal signals, frequency-domain visual representations, and textual knowledge information. In this paper, we propose TS-MLLM, a unified multi-modal large language model framework designed to jointly model temporal signals, frequency-domain images, and textual domain knowledge. Specifically, we first develop an Industrial time-series Patch Modeling branch to capture long-range temporal dynamics. To integrate cross-modal priors, we introduce a Spectrum-aware Vision-Language Model Adaptation (SVLMA) mechanism that enables the model to internalize frequency-domain patterns and semantic context. Furthermore, a Temporal-centric Multi-modal Attention Fusion (TMAF) mechanism is designed to actively retrieve relevant visual and textual cues using temporal features as queries, ensuring deep cross-modal alignment. Extensive experiments on multiple industrial benchmarks demonstrate that TS-MLLM significantly outperforms state-of-the-art methods, particularly in few-shot and complex scenarios. The results validate our framework's superior robustness, efficiency, and generalization capabilities for industrial time-series prediction.

TS-MLLM: A Multi-Modal Large Language Model-based Framework for Industrial Time-Series Big Data Analysis

TL;DR

TS-MLLM is proposed, a unified multi-modal large language model framework designed to jointly model temporal signals, frequency-domain images, and textual domain knowledge and significantly outperforms state-of-the-art methods for industrial time-series prediction.

Abstract

Accurate analysis of industrial time-series big data is critical for the Prognostics and Health Management (PHM) of industrial equipment. While recent advancements in Large Language Models (LLMs) have shown promise in time-series analysis, existing methods typically focus on single-modality adaptations, failing to exploit the complementary nature of temporal signals, frequency-domain visual representations, and textual knowledge information. In this paper, we propose TS-MLLM, a unified multi-modal large language model framework designed to jointly model temporal signals, frequency-domain images, and textual domain knowledge. Specifically, we first develop an Industrial time-series Patch Modeling branch to capture long-range temporal dynamics. To integrate cross-modal priors, we introduce a Spectrum-aware Vision-Language Model Adaptation (SVLMA) mechanism that enables the model to internalize frequency-domain patterns and semantic context. Furthermore, a Temporal-centric Multi-modal Attention Fusion (TMAF) mechanism is designed to actively retrieve relevant visual and textual cues using temporal features as queries, ensuring deep cross-modal alignment. Extensive experiments on multiple industrial benchmarks demonstrate that TS-MLLM significantly outperforms state-of-the-art methods, particularly in few-shot and complex scenarios. The results validate our framework's superior robustness, efficiency, and generalization capabilities for industrial time-series prediction.
Paper Structure (28 sections, 17 equations, 8 figures, 3 tables)

This paper contains 28 sections, 17 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The overall architecture of the proposed TS-MLLM. It consists of three primary modules: (1) An Industrial time-series Modeling Branch for capturing long-range temporal dependencies; (2) SVLMA aligns frequency-domain representation with textual domain knowledge via MLLM; and (3) TMAF mechanism utilizes temporal features as queries to actively retrieve and integrate multi-modal cues for final prediction.
  • Figure 2: Qualitative RUL prediction on C-MAPSS. Results on representative test engines from FD001 to FD004. For each subset, the top plot shows predicted and ground-truth RUL over time steps, and the bottom plot shows the absolute error over time steps.
  • Figure 3: Few-shot performance on C-MAPSS. Results on FD001 to FD004 under different training-data ratios from 5% to 100%, evaluated by Score, MAE, MAPE, and RMSE. Lower is better.
  • Figure 4: UMAP projections of time-series embeddings and MLLM feature embeddings on FD001--FD004. The two embedding sets are largely separated across subsets.
  • Figure 5: Distributions of fusion-gate weights on FD001--FD004. Weights for time-series and MLLM features span a broad range and vary across subsets.
  • ...and 3 more figures