Table of Contents
Fetching ...

FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation

Pingyi Fan, Anbai Jiang, Shuwei Zhang, Zhiqiang Lv, Bing Han, Xinhu Zheng, Wenrui Liang, Junjie Li, Wei-Qiang Zhang, Yanmin Qian, Xie Chen, Cheng Lu, Jia Liu

TL;DR

This work addresses the M5 challenge of multi-modal, multi-sampling-rate industrial signals by introducing FISHER, a STFT-based sub-band representation powered by a teacher-student self-supervised framework and a ViT-backed encoder. It unifies diverse signal modalities and sampling rates by concatenating sub-band information, and validates the approach on RMIS, a comprehensive benchmark spanning anomaly detection and fault diagnosis, achieving up to 4.2% gains over baselines with superior scaling while remaining efficient at smaller sizes. The RMIS benchmark further demonstrates FISHER's robustness across tasks and splits, with the tiny variant outperforming many baselines despite far fewer parameters. The work provides a practical, open-source pathway toward scalable industrial signal representations that harness cross-modality synergies for health management in manufacturing.

Abstract

With the rapid deployment of SCADA systems, how to effectively analyze industrial signals and detect abnormal states is an urgent need for the industry. Due to the significant heterogeneity of these signals, which we summarize as the M5 problem, previous works only focus on small sub-problems and employ specialized models, failing to utilize the synergies between modalities and the powerful scaling law. However, we argue that the M5 signals can be modeled in a unified manner due to the intrinsic similarity. As a result, we propose FISHER, a Foundation model for multi-modal Industrial Signal compreHEnsive Representation. To support arbitrary sampling rates, FISHER considers the increment of sampling rate as the concatenation of sub-band information. Specifically, FISHER takes the STFT sub-band as the modeling unit and adopts a teacher student SSL framework for pre-training. We also develop the RMIS benchmark, which evaluates the representations of M5 industrial signals on multiple health management tasks. Compared with top SSL models, FISHER showcases versatile and outstanding capabilities with a general performance gain up to 4.2%, along with much more efficient scaling curves. We also investigate the scaling law on downstream tasks and derive potential avenues for future work. Both FISHER and RMIS are now open-sourced.

FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation

TL;DR

This work addresses the M5 challenge of multi-modal, multi-sampling-rate industrial signals by introducing FISHER, a STFT-based sub-band representation powered by a teacher-student self-supervised framework and a ViT-backed encoder. It unifies diverse signal modalities and sampling rates by concatenating sub-band information, and validates the approach on RMIS, a comprehensive benchmark spanning anomaly detection and fault diagnosis, achieving up to 4.2% gains over baselines with superior scaling while remaining efficient at smaller sizes. The RMIS benchmark further demonstrates FISHER's robustness across tasks and splits, with the tiny variant outperforming many baselines despite far fewer parameters. The work provides a practical, open-source pathway toward scalable industrial signal representations that harness cross-modality synergies for health management in manufacturing.

Abstract

With the rapid deployment of SCADA systems, how to effectively analyze industrial signals and detect abnormal states is an urgent need for the industry. Due to the significant heterogeneity of these signals, which we summarize as the M5 problem, previous works only focus on small sub-problems and employ specialized models, failing to utilize the synergies between modalities and the powerful scaling law. However, we argue that the M5 signals can be modeled in a unified manner due to the intrinsic similarity. As a result, we propose FISHER, a Foundation model for multi-modal Industrial Signal compreHEnsive Representation. To support arbitrary sampling rates, FISHER considers the increment of sampling rate as the concatenation of sub-band information. Specifically, FISHER takes the STFT sub-band as the modeling unit and adopts a teacher student SSL framework for pre-training. We also develop the RMIS benchmark, which evaluates the representations of M5 industrial signals on multiple health management tasks. Compared with top SSL models, FISHER showcases versatile and outstanding capabilities with a general performance gain up to 4.2%, along with much more efficient scaling curves. We also investigate the scaling law on downstream tasks and derive potential avenues for future work. Both FISHER and RMIS are now open-sourced.

Paper Structure

This paper contains 17 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Model Performances on the RMIS benchmark, which consists of two types of health management tasks and 19 distinct datasets, covering four modalities. For each dataset, the higher the score is, the better the model is. Compared with top baseline models, FISHER achieves superior performances with much smaller model size, especially on fault diagnosis tasks, demonstrating versatile capabilities and efficient scaling properties.
  • Figure 2: Pipeline of FISHER. FISHER converts signals into STFT spectrograms and splits them into sub-bands with fixed bandwidth $w$. These sub-band are processed individually by the ViT backbone and the [CLS] embeddings are concatenated as the signal representations.
  • Figure 3: STFT Spectrograms of the same source under different sampling rates. Here we adopt fixed-duration window and hop size for STFT. Higher sampling rates comprise additional sub-bands which carry extra information. Therefore, it is heuristic to select the sub-band as the modeling unit.
  • Figure 4: STFT Spectrograms of two vibration signals from the WTPG dataset. Both are extremely stationary throughout the entire clip (more than 300 s), causing the split segments to be highly identical.
  • Figure 5: Performance curve on the RMIS benchmark. The horizontal axis is the model size, while the vertical axis is the dataset score. The higher the curve is, the better the model is. FISHER is the most versatile model on the RMIS benchmark, which is generally the second best model for anomaly detection (slightly behind BEATs) and the best model for fault diagnosis. Moreover, FISHER achieves better performances with much smaller model size, which can be readily deployed in manufacturing.
  • ...and 1 more figures