Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains

Guodong Sun; Qihang Liang; Xingyu Pan; Moyun Liu; Yang Zhang

Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains

Guodong Sun, Qihang Liang, Xingyu Pan, Moyun Liu, Yang Zhang

Abstract

Accurate visual fault detection in freight trains remains a critical challenge for intelligent transportation system maintenance, due to complex operational environments, structurally repetitive components, and frequent occlusions or contaminations in safety-critical regions. Conventional instance segmentation methods based on convolutional neural networks and Transformers often suffer from poor generalization and limited boundary accuracy under such conditions. To address these challenges, we propose a lightweight self-prompted instance segmentation framework tailored for freight train fault detection. Our method leverages the Segment Anything Model by introducing a self-prompt generation module that automatically produces task-specific prompts, enabling effective knowledge transfer from foundation models to domain-specific inspection tasks. In addition, we adopt a Tiny Vision Transformer backbone to reduce computational cost, making the framework suitable for real-time deployment on edge devices in railway monitoring systems. We construct a domain-specific dataset collected from real-world freight inspection stations and conduct extensive evaluations. Experimental results show that our method achieves 74.6 $AP^{\text{box}}$ and 74.2 $AP^{\text{mask}}$ on the dataset, outperforming existing state-of-the-art methods in both accuracy and robustness while maintaining low computational overhead. This work offers a deployable and efficient vision solution for automated freight train inspection, demonstrating the potential of foundation model adaptation in industrial-scale fault diagnosis scenarios. Project page: https://github.com/MVME-HBUT/SAM_FTI-FDet.git

Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains

Abstract

and 74.2

on the dataset, outperforming existing state-of-the-art methods in both accuracy and robustness while maintaining low computational overhead. This work offers a deployable and efficient vision solution for automated freight train inspection, demonstrating the potential of foundation model adaptation in industrial-scale fault diagnosis scenarios. Project page: https://github.com/MVME-HBUT/SAM_FTI-FDet.git

Paper Structure (31 sections, 6 equations, 9 figures, 12 tables)

This paper contains 31 sections, 6 equations, 9 figures, 12 tables.

Introduction
Related Work
Fault Detection for Freight Train Images
Foundation Model
Prompt Learning
METHOD
Overall Framework
Prompt Generator
Adaptive Feature Dispatcher
Mask Decoder
EXPERIMENTS
Experiments Setup
Datasets
Evaluation Metrics
Implementation Details
...and 16 more sections

Figures (9)

Figure 1: Visualization results of activation maps from various detectors. (a) Ground truth annotations for each image. (b) Activation maps of Mask R-CNN. (c) Activation maps of SAM. (d) Our proposed prompter-driven SAM detection method. As the activation maps exemplify, our method can more accurately capture the region of interest due to the efficient prompt mechanism.
Figure 2: An overview of our proposed SAM-based framework for visual fault detection of freight trains is presented. The network is designed to leverage a prompt generator to provide high-quality prompts for TinyViT-SAM, thereby enhancing the segmentation capability of the model. Furthermore, spatial and hierarchical embeddings are incorporated to improve the image perception ability of TinyViT-SAM, enabling efficient and low-cost optimization.
Figure 3: The visualization comparison results of feature maps with and without prompts in the mask decoder stage are presented based on three typical viewpoints. (a) Ground truth annotations. (b) Feature maps generated using a conventional detection head without prompt guidance. (c) Feature maps obtained after embedding prompts in the mask decoder stage. With the proposed architecture, the target regions in (a) can be reliably detected.
Figure 4: Image acquirement in the wild. (a) Side view. (b) Bottom view.
Figure 5: Visualization samples of the dataset, where “T” indicates Normal and “F” indicates Damaged or Missing. For brake shoes, the primary criterion is their thickness. For other parts, the main criterion is the presence of damage or absence.
...and 4 more figures

Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains

Abstract

Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains

Authors

Abstract

Table of Contents

Figures (9)