Table of Contents
Fetching ...

SERVIMON: AI-Driven Predictive Maintenance and Real-Time Monitoring for Astronomical Observatories

Emilio Mastriani, Alessandro Costa, Federico Incardona, Kevin Munari, Sebastiano Spinello

TL;DR

ServiMon tackles the reliability challenges of distributed astronomical observatories by delivering a scalable, cloud-native telemetry and anomaly-detection pipeline. It combines Prometheus-Grafana-InfluxDB-based data collection with a machine-learning core that uses Isolation Forest for real-time anomaly detection and predictive maintenance in Cassandra-backed telemetry streams. The paper details end-to-end architecture, training and inference workflows, and validation using synthetic load tests, demonstrating early degradation detection and proactive system management. The approach is scalable to future large-scale observatories by enabling real-time monitoring, reduced downtime, and adaptable data-driven maintenance strategies.

Abstract

Objective: ServiMon is designed to offer a scalable and intelligent pipeline for data collection and auditing to monitor distributed astronomical systems such as the ASTRI Mini-Array. The system enhances quality control, predictive maintenance, and real-time anomaly detection for telescope operations. Methods: ServiMon integrates cloud-native technologies-including Prometheus, Grafana, Cassandra, Kafka, and InfluxDB-for telemetry collection and processing. It employs machine learning algorithms, notably Isolation Forest, to detect anomalies in Cassandra performance metrics. Key indicators such as read/write latency, throughput, and memory usage are continuously monitored, stored as time-series data, and preprocessed for feature engineering. Anomalies detected by the model are logged in InfluxDB v2 and accessed via Flux for real-time monitoring and visualization. Results: AI-based anomaly detection increases system resilience by identifying performance degradation at an early stage, minimizing downtime, and optimizing telescope operations. Additionally, ServiMon supports astrostatistical analysis by correlating telemetry with observational data, thus enhancing scientific data quality. AI-generated alerts also improve real-time monitoring, enabling proactive system management. Conclusion: ServiMon's scalable framework proves effective for predictive maintenance and real-time monitoring of astronomical infrastructures. By leveraging cloud and edge computing, it is adaptable to future large-scale experiments, optimizing both performance and cost. The combination of machine learning and big data analytics makes ServiMon a robust and flexible solution for modern and next-generation observational astronomy.

SERVIMON: AI-Driven Predictive Maintenance and Real-Time Monitoring for Astronomical Observatories

TL;DR

ServiMon tackles the reliability challenges of distributed astronomical observatories by delivering a scalable, cloud-native telemetry and anomaly-detection pipeline. It combines Prometheus-Grafana-InfluxDB-based data collection with a machine-learning core that uses Isolation Forest for real-time anomaly detection and predictive maintenance in Cassandra-backed telemetry streams. The paper details end-to-end architecture, training and inference workflows, and validation using synthetic load tests, demonstrating early degradation detection and proactive system management. The approach is scalable to future large-scale observatories by enabling real-time monitoring, reduced downtime, and adaptable data-driven maintenance strategies.

Abstract

Objective: ServiMon is designed to offer a scalable and intelligent pipeline for data collection and auditing to monitor distributed astronomical systems such as the ASTRI Mini-Array. The system enhances quality control, predictive maintenance, and real-time anomaly detection for telescope operations. Methods: ServiMon integrates cloud-native technologies-including Prometheus, Grafana, Cassandra, Kafka, and InfluxDB-for telemetry collection and processing. It employs machine learning algorithms, notably Isolation Forest, to detect anomalies in Cassandra performance metrics. Key indicators such as read/write latency, throughput, and memory usage are continuously monitored, stored as time-series data, and preprocessed for feature engineering. Anomalies detected by the model are logged in InfluxDB v2 and accessed via Flux for real-time monitoring and visualization. Results: AI-based anomaly detection increases system resilience by identifying performance degradation at an early stage, minimizing downtime, and optimizing telescope operations. Additionally, ServiMon supports astrostatistical analysis by correlating telemetry with observational data, thus enhancing scientific data quality. AI-generated alerts also improve real-time monitoring, enabling proactive system management. Conclusion: ServiMon's scalable framework proves effective for predictive maintenance and real-time monitoring of astronomical infrastructures. By leveraging cloud and edge computing, it is adaptable to future large-scale experiments, optimizing both performance and cost. The combination of machine learning and big data analytics makes ServiMon a robust and flexible solution for modern and next-generation observational astronomy.

Paper Structure

This paper contains 7 sections, 4 figures.

Figures (4)

  • Figure 1: Three blocks interaction
  • Figure 2: System overview and ML anomaly detection results
  • Figure 3: Anomalies shown in the browser
  • Figure 4: Anomaly detection on log files