Table of Contents
Fetching ...

FaultGPT: Industrial Fault Diagnosis Question Answering System by Vision Language Models

Jiao Chen, Ruyi Huang, Zuohong Lv, Jianhua Tang, Weihua Li

TL;DR

FaultGPT addresses industrial fault diagnosis by generating diagnostic reports directly from vibration signals using end-to-end FDQA with LVLMs. It introduces a multimodal FDQA dataset and a Multi-Scale Cross-modal Image Decoder with a Prompt Learner that aligns time-frequency image features with language outputs without modifying LVLM weights. The approach demonstrates superior fault-report generation and cross-dataset generalization in few-shot and zero-shot settings across CWRU, SCUT-FD, and Ottawa datasets. This work advances interpretable, report-style fault diagnosis suitable for industrial deployment and suggests future work on remaining useful life and other domains.

Abstract

Recently, employing single-modality large language models based on mechanical vibration signals as Tuning Predictors has introduced new perspectives in intelligent fault diagnosis. However, the potential of these methods to leverage multimodal data remains underexploited, particularly in complex mechanical systems where relying on a single data source often fails to capture comprehensive fault information. In this paper, we present FaultGPT, a novel model that generates fault diagnosis reports directly from raw vibration signals. By leveraging large vision-language models (LVLM) and text-based supervision, FaultGPT performs end-to-end fault diagnosis question answering (FDQA), distinguishing itself from traditional classification or regression approaches. Specifically, we construct a large-scale FDQA instruction dataset for instruction tuning of LVLM. This dataset includes vibration time-frequency image-text label pairs and human instruction-ground truth pairs. To enhance the capability in generating high-quality fault diagnosis reports, we design a multi-scale cross-modal image decoder to extract fine-grained fault semantics and conducted instruction tuning without introducing additional training parameters into the LVLM. Extensive experiments, including fault diagnosis report generation, few-shot and zero-shot evaluation across multiple datasets, validate the superior performance and adaptability of FaultGPT in diverse industrial scenarios.

FaultGPT: Industrial Fault Diagnosis Question Answering System by Vision Language Models

TL;DR

FaultGPT addresses industrial fault diagnosis by generating diagnostic reports directly from vibration signals using end-to-end FDQA with LVLMs. It introduces a multimodal FDQA dataset and a Multi-Scale Cross-modal Image Decoder with a Prompt Learner that aligns time-frequency image features with language outputs without modifying LVLM weights. The approach demonstrates superior fault-report generation and cross-dataset generalization in few-shot and zero-shot settings across CWRU, SCUT-FD, and Ottawa datasets. This work advances interpretable, report-style fault diagnosis suitable for industrial deployment and suggests future work on remaining useful life and other domains.

Abstract

Recently, employing single-modality large language models based on mechanical vibration signals as Tuning Predictors has introduced new perspectives in intelligent fault diagnosis. However, the potential of these methods to leverage multimodal data remains underexploited, particularly in complex mechanical systems where relying on a single data source often fails to capture comprehensive fault information. In this paper, we present FaultGPT, a novel model that generates fault diagnosis reports directly from raw vibration signals. By leveraging large vision-language models (LVLM) and text-based supervision, FaultGPT performs end-to-end fault diagnosis question answering (FDQA), distinguishing itself from traditional classification or regression approaches. Specifically, we construct a large-scale FDQA instruction dataset for instruction tuning of LVLM. This dataset includes vibration time-frequency image-text label pairs and human instruction-ground truth pairs. To enhance the capability in generating high-quality fault diagnosis reports, we design a multi-scale cross-modal image decoder to extract fine-grained fault semantics and conducted instruction tuning without introducing additional training parameters into the LVLM. Extensive experiments, including fault diagnosis report generation, few-shot and zero-shot evaluation across multiple datasets, validate the superior performance and adaptability of FaultGPT in diverse industrial scenarios.

Paper Structure

This paper contains 30 sections, 10 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Inference process of FaultGPT compared to traditional fault diagnosis methods.
  • Figure 2: Comparison of previous fault diagnosis methods with our proposed FaultGPT.
  • Figure 3: The time-frequency images and energy characteristics in different bearings.
  • Figure 4: The overall training framework of the proposed FaultGPT. ① represents the visual encoder, ② indicates proposed multi-scale cross-modal image decoder, and ③ donates proposed prompt learner.
  • Figure 5: Example of fault diagnosis instruction data.
  • ...and 6 more figures