Table of Contents
Fetching ...

Monitizer: Automating Design and Evaluation of Neural Network Monitors

Muqsit Azeem, Marta Grobelna, Sudeep Kanav, Jan Kretinsky, Stefanie Mohr, Sabine Rieder

TL;DR

This paper tackles the safety challenge of neural networks encountering out-of-distribution data by advocating runtime monitoring as a scalable alternative to costly verification. It introduces Monitizer, a modular framework that automates the construction, optimization, and evaluation of NN OOD monitors, supporting 19 monitors across 9 datasets and 15 neural networks, with three optimization methods (random, grid-search, gradient-descent) and multi-objective capabilities. The framework enables end-to-end, objective-driven tuning and transparent per-OOD-class evaluation, addressing the reproducibility and comparability gaps in prior work. Empirical results from a MNIST-based case study demonstrate that monitor performance is highly class-dependent and sensitive to optimization choices, underscoring the value of Monitizer for principled, repeatable benchmarking and deployment-ready monitor selection.

Abstract

The behavior of neural networks (NNs) on previously unseen types of data (out-of-distribution or OOD) is typically unpredictable. This can be dangerous if the network's output is used for decision-making in a safety-critical system. Hence, detecting that an input is OOD is crucial for the safe application of the NN. Verification approaches do not scale to practical NNs, making runtime monitoring more appealing for practical use. While various monitors have been suggested recently, their optimization for a given problem, as well as comparison with each other and reproduction of results, remain challenging. We present a tool for users and developers of NN monitors. It allows for (i) application of various types of monitors from the literature to a given input NN, (ii) optimization of the monitor's hyperparameters, and (iii) experimental evaluation and comparison to other approaches. Besides, it facilitates the development of new monitoring approaches. We demonstrate the tool's usability on several use cases of different types of users as well as on a case study comparing different approaches from recent literature.

Monitizer: Automating Design and Evaluation of Neural Network Monitors

TL;DR

This paper tackles the safety challenge of neural networks encountering out-of-distribution data by advocating runtime monitoring as a scalable alternative to costly verification. It introduces Monitizer, a modular framework that automates the construction, optimization, and evaluation of NN OOD monitors, supporting 19 monitors across 9 datasets and 15 neural networks, with three optimization methods (random, grid-search, gradient-descent) and multi-objective capabilities. The framework enables end-to-end, objective-driven tuning and transparent per-OOD-class evaluation, addressing the reproducibility and comparability gaps in prior work. Empirical results from a MNIST-based case study demonstrate that monitor performance is highly class-dependent and sensitive to optimization choices, underscoring the value of Monitizer for principled, repeatable benchmarking and deployment-ready monitor selection.

Abstract

The behavior of neural networks (NNs) on previously unseen types of data (out-of-distribution or OOD) is typically unpredictable. This can be dangerous if the network's output is used for decision-making in a safety-critical system. Hence, detecting that an input is OOD is crucial for the safe application of the NN. Verification approaches do not scale to practical NNs, making runtime monitoring more appealing for practical use. While various monitors have been suggested recently, their optimization for a given problem, as well as comparison with each other and reproduction of results, remain challenging. We present a tool for users and developers of NN monitors. It allows for (i) application of various types of monitors from the literature to a given input NN, (ii) optimization of the monitor's hyperparameters, and (iii) experimental evaluation and comparison to other approaches. Besides, it facilitates the development of new monitoring approaches. We demonstrate the tool's usability on several use cases of different types of users as well as on a case study comparing different approaches from recent literature.
Paper Structure (33 sections, 1 equation, 10 figures, 4 tables)

This paper contains 33 sections, 1 equation, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Illustration of challenges for OOD detection
  • Figure 2: Architecture of Monitizer: The required inputs are an NN and the dataset (both can be chosen from existing options). The dashed area indicates optional inputs, and the bold-faced option indicates the default value. The icons indicate which types of users are expected to use each of the options.
  • Figure 3: Class diagram depicting the different types of OOD data.
  • Figure 4: Examples for OOD
  • Figure 5: The monitor templates were optimized on MNIST as ID and for detecting New-World / CIFAR-10 as OOD while keeping 70% accuracy on ID. All monitors were optimized randomly.
  • ...and 5 more figures