DeepKnowledge: Generalisation-Driven Deep Learning Testing
Sondess Missaoui, Simos Gerasimou, Nikolaos Matragkas
TL;DR
DeepKnowledge tackles the fragility of DNNs under distribution shifts by grounding testing in knowledge generalisation. It identifies Transfer Knowledge (TK) neurons through activation-distribution analysis and ZeroShot-based domain shifts, then quantifies test adequacy with a Transfer Knowledge Coverage (TKC) criterion that measures how well a test set activates diverse TK neuron clusters. Empirical results show TK-based data augmentation can improve generalisation modestly, while adversarial inputs can boost TKC sensitivity and highlight misbehaviours, with correlations to existing criteria. The approach offers a knowledge-driven framework for more dependable DNN testing and points to practical extensions like object detection and automated augmentation.
Abstract
Despite their unprecedented success, DNNs are notoriously fragile to small shifts in data distribution, demanding effective testing techniques that can assess their dependability. Despite recent advances in DNN testing, there is a lack of systematic testing approaches that assess the DNN's capability to generalise and operate comparably beyond data in their training distribution. We address this gap with DeepKnowledge, a systematic testing methodology for DNN-based systems founded on the theory of knowledge generalisation, which aims to enhance DNN robustness and reduce the residual risk of 'black box' models. Conforming to this theory, DeepKnowledge posits that core computational DNN units, termed Transfer Knowledge neurons, can generalise under domain shift. DeepKnowledge provides an objective confidence measurement on testing activities of DNN given data distribution shifts and uses this information to instrument a generalisation-informed test adequacy criterion to check the transfer knowledge capacity of a test set. Our empirical evaluation of several DNNs, across multiple datasets and state-of-the-art adversarial generation techniques demonstrates the usefulness and effectiveness of DeepKnowledge and its ability to support the engineering of more dependable DNNs. We report improvements of up to 10 percentage points over state-of-the-art coverage criteria for detecting adversarial attacks on several benchmarks, including MNIST, SVHN, and CIFAR.
