Table of Contents
Fetching ...

Floating-Point Neural Network Verification at the Software Level

Edoardo Manino, Bruno Farias, Rafael Sá Menezes, Fedor Shmarov, Lucas C. Cordeiro

TL;DR

NeuroCodeBench 2.0 establishes a substantial, ground-truth benchmark for verifying floating-point neural network implementations at the software level, stressing SV-COMP-compatible C artifacts across nine categories from individual functions to full networks. The authors conduct a large-scale evaluation of eight state-of-the-art verifiers, revealing limited but nonzero success (average around $11\%$ correct verdicts) and nontrivial rates of incorrect results (around $3\%$), while demonstrating a clear positive impact on tooling since the prior benchmark. By releasing a diverse suite with explicit pre/post-conditions, plus models of math.h, the work highlights both the feasibility and the substantial gaps remaining for scalable, trustworthy verification of neural networks deployed in safety-critical contexts. The findings underscore the need for native math libraries, dedicated neural-network reasoning procedures, and ongoing benchmark evolution to drive meaningful progress in certified FP NN verification.

Abstract

The behaviour of neural network components must be proven correct before deployment in safety-critical systems. Unfortunately, existing neural network verification techniques cannot certify the absence of faults at the software level. In this paper, we show how to specify and verify that neural networks are safe, by explicitly reasoning about their floating-point implementation. In doing so, we construct NeuroCodeBench 2.0, a benchmark comprising 912 neural network verification examples that cover activation functions, common layers, and full neural networks of up to 170K parameters. Our verification suite is written in plain C and is compatible with the format of the International Competition on Software Verification (SV-COMP). Thanks to it, we can conduct the first rigorous evaluation of eight state-of-the-art software verifiers on neural network code. The results show that existing automated verification tools can correctly solve an average of 11% of our benchmark, while producing around 3% incorrect verdicts. At the same time, a historical analysis reveals that the release of our benchmark has already had a significantly positive impact on the latter.

Floating-Point Neural Network Verification at the Software Level

TL;DR

NeuroCodeBench 2.0 establishes a substantial, ground-truth benchmark for verifying floating-point neural network implementations at the software level, stressing SV-COMP-compatible C artifacts across nine categories from individual functions to full networks. The authors conduct a large-scale evaluation of eight state-of-the-art verifiers, revealing limited but nonzero success (average around correct verdicts) and nontrivial rates of incorrect results (around ), while demonstrating a clear positive impact on tooling since the prior benchmark. By releasing a diverse suite with explicit pre/post-conditions, plus models of math.h, the work highlights both the feasibility and the substantial gaps remaining for scalable, trustworthy verification of neural networks deployed in safety-critical contexts. The findings underscore the need for native math libraries, dedicated neural-network reasoning procedures, and ongoing benchmark evolution to drive meaningful progress in certified FP NN verification.

Abstract

The behaviour of neural network components must be proven correct before deployment in safety-critical systems. Unfortunately, existing neural network verification techniques cannot certify the absence of faults at the software level. In this paper, we show how to specify and verify that neural networks are safe, by explicitly reasoning about their floating-point implementation. In doing so, we construct NeuroCodeBench 2.0, a benchmark comprising 912 neural network verification examples that cover activation functions, common layers, and full neural networks of up to 170K parameters. Our verification suite is written in plain C and is compatible with the format of the International Competition on Software Verification (SV-COMP). Thanks to it, we can conduct the first rigorous evaluation of eight state-of-the-art software verifiers on neural network code. The results show that existing automated verification tools can correctly solve an average of 11% of our benchmark, while producing around 3% incorrect verdicts. At the same time, a historical analysis reveals that the release of our benchmark has already had a significantly positive impact on the latter.

Paper Structure

This paper contains 47 sections, 9 equations, 19 figures, 11 tables.

Figures (19)

  • Figure 1: The SoftSign activation is a non-decreasing function in infinite precision (left) but not when implemented in 32-bit IEEE 754 floating-point arithmetic (right).
  • Figure 2: Example implementation of SoftSign function from Figure \ref{['fig:numerical_issues']} together with the monotonicity property that should hold for all values of x1 and x2 within the numerical limits of the float type. The code is adapted from the file "softsign_2_unsafe.c" of NeuroCodeBench 2.0.
  • Figure 3: Visual comparison of popular activation functions.
  • Figure 4: An (unrolled) Hopfield network with SoftSign activations and $t=4$ recurrent steps. The network is able to recover the correct (positive) sign for all output entries, when a strict majority of the input entries are correct. We omit the final classification layer (dashed) from our benchmarks.
  • Figure 5: Details of the SAT ReLU network architecture (a sketch).
  • ...and 14 more figures