Improving Interpretability and Accuracy in Neuro-Symbolic Rule Extraction Using Class-Specific Sparse Filters
Parth Padalkar, Jaeseong Lee, Shiyi Wei, Gopal Gupta
TL;DR
The paper tackles the interpretability-accuracy trade-off in neuro-symbolic rule extraction for image classification by isolating post-training binarization as a key source of information loss. It introduces a novel class-specific sparsity loss that guides CNNs to produce sparse, near-binary filter activations during training, enabling higher-fidelity rule extraction via the NeSyFOLD/FOLD-SE-M pipeline. Across multiple training strategies and datasets, the approach yields a 9% average accuracy uplift and a 53% reduction in rule-set size, bringing NeSy performance within about $3$–$4\%$ of the original CNN while maintaining interpretability. The work demonstrates that carefully designed sparsity objectives can make interpretable neuro-symbolic models competitive with black-box CNNs, and discusses practical guidance and future extensions to other architectures like Vision Transformers.
Abstract
There has been significant focus on creating neuro-symbolic models for interpretable image classification using Convolutional Neural Networks (CNNs). These methods aim to replace the CNN with a neuro-symbolic model consisting of the CNN, which is used as a feature extractor, and an interpretable rule-set extracted from the CNN itself. While these approaches provide interpretability through the extracted rule-set, they often compromise accuracy compared to the original CNN model. In this paper, we identify the root cause of this accuracy loss as the post-training binarization of filter activations to extract the rule-set. To address this, we propose a novel sparsity loss function that enables class-specific filter binarization during CNN training, thus minimizing information loss when extracting the rule-set. We evaluate several training strategies with our novel sparsity loss, analyzing their effectiveness and providing guidance on their appropriate use. Notably, we set a new benchmark, achieving a 9% improvement in accuracy and a 53% reduction in rule-set size on average, compared to the previous SOTA, while coming within 3% of the original CNN's accuracy. This highlights the significant potential of interpretable neuro-symbolic models as viable alternatives to black-box CNNs.
