Floating-Point Neural Network Verification at the Software Level
Edoardo Manino, Bruno Farias, Rafael Sá Menezes, Fedor Shmarov, Lucas C. Cordeiro
TL;DR
NeuroCodeBench 2.0 establishes a substantial, ground-truth benchmark for verifying floating-point neural network implementations at the software level, stressing SV-COMP-compatible C artifacts across nine categories from individual functions to full networks. The authors conduct a large-scale evaluation of eight state-of-the-art verifiers, revealing limited but nonzero success (average around $11\%$ correct verdicts) and nontrivial rates of incorrect results (around $3\%$), while demonstrating a clear positive impact on tooling since the prior benchmark. By releasing a diverse suite with explicit pre/post-conditions, plus models of math.h, the work highlights both the feasibility and the substantial gaps remaining for scalable, trustworthy verification of neural networks deployed in safety-critical contexts. The findings underscore the need for native math libraries, dedicated neural-network reasoning procedures, and ongoing benchmark evolution to drive meaningful progress in certified FP NN verification.
Abstract
The behaviour of neural network components must be proven correct before deployment in safety-critical systems. Unfortunately, existing neural network verification techniques cannot certify the absence of faults at the software level. In this paper, we show how to specify and verify that neural networks are safe, by explicitly reasoning about their floating-point implementation. In doing so, we construct NeuroCodeBench 2.0, a benchmark comprising 912 neural network verification examples that cover activation functions, common layers, and full neural networks of up to 170K parameters. Our verification suite is written in plain C and is compatible with the format of the International Competition on Software Verification (SV-COMP). Thanks to it, we can conduct the first rigorous evaluation of eight state-of-the-art software verifiers on neural network code. The results show that existing automated verification tools can correctly solve an average of 11% of our benchmark, while producing around 3% incorrect verdicts. At the same time, a historical analysis reveals that the release of our benchmark has already had a significantly positive impact on the latter.
