Table of Contents
Fetching ...

The Simons Observatory: Alarms and Detector Quality Monitoring

David V. Nguyen, Sanah Bhimani, Nicholas Galitzki, Brian J. Koopman, Jack Lashner, Laura Newburgh, Max Silva-Feaver, Kyohei Yamada

TL;DR

The paper addresses the need for real-time alarms and data-quality monitoring in the Simons Observatory (SO) with more than 60,000 detectors across four telescopes. It details an integrated alarm system within the Observatory Control System (OCS) that uses InfluxDB, Grafana, Slack, and the campana service to monitor both housekeeping and detector-quality metrics, including architecture, data sources, and deployment experiences. Commissioning results demonstrate the system can manage hundreds of alert rules, support alert silencing, and promptly flag issues such as a helium leak in the PTC system, enabling rapid response and safeguarding observing time. The work presents a scalable, extensible framework suitable for SO expansions to UK/Japan and the Advanced Simons Observatory (ASO), with practical implications for robust, multi-telescope operations and data-quality assurance.

Abstract

The Simons Observatory (SO) is a group of modern telescopes dedicated to observing the polarized cosmic microwave background (CMB), transients, and more. The Observatory consists of four telescopes and instruments, with over 60,000 superconducting detectors in total, located at ~5,200 m altitude in the Atacama Desert of Chile. During observations, it is important to ensure the detectors, telescope platforms, calibration and receiver hardware, and site hardware are within operational bounds. To facilitate rapid response when problems arise with any part of the system, it is essential that alerts are generated and distributed to appropriate personnel if components exceed these bounds. Similarly, alerts are generated if the quality of the data has become degraded. In this paper, we describe the SO alarm system we developed within the larger Observatory Control System (OCS) framework, including the data sources, alert architecture, and implementation. We also present results from deploying the alarm system during the commissioning of the SO telescopes and receivers.

The Simons Observatory: Alarms and Detector Quality Monitoring

TL;DR

The paper addresses the need for real-time alarms and data-quality monitoring in the Simons Observatory (SO) with more than 60,000 detectors across four telescopes. It details an integrated alarm system within the Observatory Control System (OCS) that uses InfluxDB, Grafana, Slack, and the campana service to monitor both housekeeping and detector-quality metrics, including architecture, data sources, and deployment experiences. Commissioning results demonstrate the system can manage hundreds of alert rules, support alert silencing, and promptly flag issues such as a helium leak in the PTC system, enabling rapid response and safeguarding observing time. The work presents a scalable, extensible framework suitable for SO expansions to UK/Japan and the Advanced Simons Observatory (ASO), with practical implications for robust, multi-telescope operations and data-quality assurance.

Abstract

The Simons Observatory (SO) is a group of modern telescopes dedicated to observing the polarized cosmic microwave background (CMB), transients, and more. The Observatory consists of four telescopes and instruments, with over 60,000 superconducting detectors in total, located at ~5,200 m altitude in the Atacama Desert of Chile. During observations, it is important to ensure the detectors, telescope platforms, calibration and receiver hardware, and site hardware are within operational bounds. To facilitate rapid response when problems arise with any part of the system, it is essential that alerts are generated and distributed to appropriate personnel if components exceed these bounds. Similarly, alerts are generated if the quality of the data has become degraded. In this paper, we describe the SO alarm system we developed within the larger Observatory Control System (OCS) framework, including the data sources, alert architecture, and implementation. We also present results from deploying the alarm system during the commissioning of the SO telescopes and receivers.
Paper Structure (12 sections, 6 figures, 2 tables)

This paper contains 12 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: An example 4-set Venn diagram to demonstrate the detector bias cuts. Each set is a parameter that causes the detector to be removed. The bias group (bg) is flagged if <0 since unassigned detectors are labeled with -1. The detector resistance ($r_{tes}$) is flagged if <=0. The fractional resistance ($r_{frac}$) is flagged if outside of the defined range (0.05, 0.9). The saturation power ($p_{sat}$) is flagged if outside of the defined range (0, 20). Overlaps indicate detectors that are cut in common between different parameters. This is an example detector wafer from commissioning of one of the SATs, in which there are less desirable observing conditions to illustrate the cut effects. In this case, 564 out of 1495 total detectors were cut, 214 of which were common to all flags. This visualization indicates that the $r_{frac}$ cut is the most aggressive, and all detectors removed through the $r_{tes}$ and bg cuts are in common with other parameters, implying these are less discriminatory.
  • Figure 2: The SO alarm system architecture. ocs Agents monitor hardware and SMuRF HK metrics, sending data feeds to InfluxDB (via the Influx Publisher Agent). The data processing pipeline also produces detector data quality metrics, which are sent to InfluxDB. Using the InfluxDB as the data source, Grafana monitors the data feeds and emits an alert when thresholds are surpassed. Users create visualizations and define alert rules on Grafana. When an alarm triggers, alerts are sent to Slack as well as campana, which sends emails, SMS, and phone calls. Users subscribe to campana to receive alerts. Nodes colored blue are data sources as described in Section \ref{['data sources for alarms']} and those colored red consist of the architecture and contact points described in Section \ref{['alarm system overview']}.
  • Figure 3: An example Grafana dashboard used to monitor metrics for one of the SATs. In this case, we show the health of the SMuRF system and the cryogenics system. This dashboard has a combination of different display types: time-series plots of temperatures, some with multiple sensors shown in the same plot, as well as 'single-stat' panels, which give the most recent value (for this case, the FPGA currents/temperatures and DR temperatures/pressures). Also shown are the single-stat states for binary flags from systems (in this case, that no leak is present in the system). Here, panels show green for normal operations. When thresholds are exceeded and alarms are triggered, the panels show red. Panels are grouped by the alarm groups described in Section \ref{['data sources for alarms']}.
  • Figure 4: The Grafana alert rules page shows all active alerts grouped by the alarm overview dashboards. The list can be filtered to show currently firing alerts.
  • Figure 5: The campana webpage where ROCs can enter their information, subscribe to alert groups, and toggle active notification methods. The inset shows the drop-down menu of groups to select. In this example, the user would receive text messages, but not emails and phone calls.
  • ...and 1 more figures