Table of Contents
Fetching ...

OmniFuser: Adaptive Multimodal Fusion for Service-Oriented Predictive Maintenance

Ziqi Wang, Hailiang Zhao, Yuhao Yang, Daojiang Hu, Cheng Bao, Mingyi Liu, Kai Di, Schahram Dustdar, Zhongjie Wang, Shuiguang Deng

TL;DR

OmniFuser tackles predictive maintenance for milling tools by fusing high-resolution images with cutting-force signals through a contamination-free cross-modal fusion (C$^2$F) framework augmented by proxy-based cross-modal attention (PCMA) and a recursive refinement mechanism. The model explicitly separates modality-specific and shared information, anchors fusion to original features to stabilize updates, and outputs both tool-state classifications and multi-step force forecasts as reusable service modules. Empirical results on MATWI and Mudestreda show consistent improvements over unimodal baselines and existing multimodal methods, with around 8–10% reductions in MSE/MAE and about 2% gains in classification accuracy. The approach demonstrates practical viability for service-oriented predictive maintenance in real industrial settings and provides a scalable foundation for extending to other assets and downtream maintenance tasks.

Abstract

Accurate and timely prediction of tool conditions is critical for intelligent manufacturing systems, where unplanned tool failures can lead to quality degradation and production downtime. In modern industrial environments, predictive maintenance is increasingly implemented as an intelligent service that integrates sensing, analysis, and decision support across production processes. To meet the demand for reliable and service-oriented operation, we present OmniFuser, a multimodal learning framework for predictive maintenance of milling tools that leverages both visual and sensor data. It performs parallel feature extraction from high-resolution tool images and cutting-force signals, capturing complementary spatiotemporal patterns across modalities. To effectively integrate heterogeneous features, OmniFuser employs a contamination-free cross-modal fusion mechanism that disentangles shared and modality-specific components, allowing for efficient cross-modal interaction. Furthermore, a recursive refinement pathway functions as an anchor mechanism, consistently retaining residual information to stabilize fusion dynamics. The learned representations can be encapsulated as reusable maintenance service modules, supporting both tool-state classification (e.g., Sharp, Used, Dulled) and multi-step force signal forecasting. Experiments on real-world milling datasets demonstrate that OmniFuser consistently outperforms state-of-the-art baselines, providing a dependable foundation for building intelligent industrial maintenance services.

OmniFuser: Adaptive Multimodal Fusion for Service-Oriented Predictive Maintenance

TL;DR

OmniFuser tackles predictive maintenance for milling tools by fusing high-resolution images with cutting-force signals through a contamination-free cross-modal fusion (CF) framework augmented by proxy-based cross-modal attention (PCMA) and a recursive refinement mechanism. The model explicitly separates modality-specific and shared information, anchors fusion to original features to stabilize updates, and outputs both tool-state classifications and multi-step force forecasts as reusable service modules. Empirical results on MATWI and Mudestreda show consistent improvements over unimodal baselines and existing multimodal methods, with around 8–10% reductions in MSE/MAE and about 2% gains in classification accuracy. The approach demonstrates practical viability for service-oriented predictive maintenance in real industrial settings and provides a scalable foundation for extending to other assets and downtream maintenance tasks.

Abstract

Accurate and timely prediction of tool conditions is critical for intelligent manufacturing systems, where unplanned tool failures can lead to quality degradation and production downtime. In modern industrial environments, predictive maintenance is increasingly implemented as an intelligent service that integrates sensing, analysis, and decision support across production processes. To meet the demand for reliable and service-oriented operation, we present OmniFuser, a multimodal learning framework for predictive maintenance of milling tools that leverages both visual and sensor data. It performs parallel feature extraction from high-resolution tool images and cutting-force signals, capturing complementary spatiotemporal patterns across modalities. To effectively integrate heterogeneous features, OmniFuser employs a contamination-free cross-modal fusion mechanism that disentangles shared and modality-specific components, allowing for efficient cross-modal interaction. Furthermore, a recursive refinement pathway functions as an anchor mechanism, consistently retaining residual information to stabilize fusion dynamics. The learned representations can be encapsulated as reusable maintenance service modules, supporting both tool-state classification (e.g., Sharp, Used, Dulled) and multi-step force signal forecasting. Experiments on real-world milling datasets demonstrate that OmniFuser consistently outperforms state-of-the-art baselines, providing a dependable foundation for building intelligent industrial maintenance services.

Paper Structure

This paper contains 22 sections, 2 theorems, 24 equations, 14 figures, 5 tables.

Key Result

Theorem 1

Given the modality-specific decomposition $\mathbf{Z}^m = \mathbf{P}^m + \mathbf{S}^m$ with $\mathbf{P}^m \perp \mathbf{S}^m$ for modality $m \in \{\textrm{r}, \textrm{i}\}$, the mutual information between the original modalities is lower bounded by that of the shared components:

Figures (14)

  • Figure 1: Time-frequency analysis of cutting force signals.
  • Figure 2: Decomposition into trend and residual components.
  • Figure 3: Autocorrelation analysis on two milling tools.
  • Figure 4: Overall architecture of OmniFuser. Temporal and spatial features are first extracted through dedicated modules and then progressively fused by the $\text{C}^2\text{F}$ module. The resulting fused representation is used for downstream prediction tasks.
  • Figure 5: The architecture of RTD.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof