Table of Contents
Fetching ...

CON-FOLD -- Explainable Machine Learning with Confidence

Lachlan McGinness, Peter Baumgartner

TL;DR

CON-FOLD extends the FOLD-RM framework to provide probability-based confidence scores for learned rules using the Wilson score interval, enabling reliable interpretation and pruning of rule sets. It introduces two pruning strategies (Improvement Threshold and Confidence Threshold) to reduce overfitting and complexity, and supports incorporating background or initial domain knowledge into the logic-programming model. The paper formalizes the learning framework, defines Inverse Brier Score (IBS) as a proper probabilistic performance metric, and demonstrates improvements on UCI benchmarks and a physics marking task that benefits from domain knowledge. This work advances explainable AI by enhancing trust, compactness, and applicability of rule-based classifiers, particularly in data-scarce scenarios. The results indicate CON-FOLD can outperform baselines like XGBoost in certain settings and offers a practical pathway for interpretable, knowledge-augmented ML in real-world tasks.

Abstract

FOLD-RM is an explainable machine learning classification algorithm that uses training data to create a set of classification rules. In this paper we introduce CON-FOLD which extends FOLD-RM in several ways. CON-FOLD assigns probability-based confidence scores to rules learned for a classification task. This allows users to know how confident they should be in a prediction made by the model. We present a confidence-based pruning algorithm that uses the unique structure of FOLD-RM rules to efficiently prune rules and prevent overfitting. Furthermore, CON-FOLD enables the user to provide pre-existing knowledge in the form of logic program rules that are either (fixed) background knowledge or (modifiable) initial rule candidates. The paper describes our method in detail and reports on practical experiments. We demonstrate the performance of the algorithm on benchmark datasets from the UCI Machine Learning Repository. For that, we introduce a new metric, Inverse Brier Score, to evaluate the accuracy of the produced confidence scores. Finally we apply this extension to a real world example that requires explainability: marking of student responses to a short answer question from the Australian Physics Olympiad.

CON-FOLD -- Explainable Machine Learning with Confidence

TL;DR

CON-FOLD extends the FOLD-RM framework to provide probability-based confidence scores for learned rules using the Wilson score interval, enabling reliable interpretation and pruning of rule sets. It introduces two pruning strategies (Improvement Threshold and Confidence Threshold) to reduce overfitting and complexity, and supports incorporating background or initial domain knowledge into the logic-programming model. The paper formalizes the learning framework, defines Inverse Brier Score (IBS) as a proper probabilistic performance metric, and demonstrates improvements on UCI benchmarks and a physics marking task that benefits from domain knowledge. This work advances explainable AI by enhancing trust, compactness, and applicability of rule-based classifiers, particularly in data-scarce scenarios. The results indicate CON-FOLD can outperform baselines like XGBoost in certain settings and offers a practical pathway for interpretable, knowledge-augmented ML in real-world tasks.

Abstract

FOLD-RM is an explainable machine learning classification algorithm that uses training data to create a set of classification rules. In this paper we introduce CON-FOLD which extends FOLD-RM in several ways. CON-FOLD assigns probability-based confidence scores to rules learned for a classification task. This allows users to know how confident they should be in a prediction made by the model. We present a confidence-based pruning algorithm that uses the unique structure of FOLD-RM rules to efficiently prune rules and prevent overfitting. Furthermore, CON-FOLD enables the user to provide pre-existing knowledge in the form of logic program rules that are either (fixed) background knowledge or (modifiable) initial rule candidates. The paper describes our method in detail and reports on practical experiments. We demonstrate the performance of the algorithm on benchmark datasets from the UCI Machine Learning Repository. For that, we introduce a new metric, Inverse Brier Score, to evaluate the accuracy of the produced confidence scores. Finally we apply this extension to a real world example that requires explainability: marking of student responses to a short answer question from the Australian Physics Olympiad.
Paper Structure (12 sections, 5 theorems, 11 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 12 sections, 5 theorems, 11 equations, 4 figures, 1 table, 2 algorithms.

Key Result

theorem 1

In the limit where there is a large amount of data classified by a rule ($n \rightarrow \infty$), the confidence score approaches the true probability of the sample being from the target class.

Figures (4)

  • Figure 1: This toy example illustrates the difference between the FOLD-RM and CON-FOLD core algorithms. Both produce rules of the form shown. CON-FOLD would not consider the Flamingo as part of the data to fit when generating rule 2. FOLD-RM would consider the Flamingo. Note that in many cases both algorithms would generate an abnormal rule ab(X) :- flamingo(X), preventing the Flamingo from being covered by the first rule. In this case both FOLD-RM and CON-FOLD would include the Flamingo. When harsh pruning occurs and there are few abnormal rules, this subtle change becomes noticable.
  • Figure 2: Scatter plot of the accuracy and number of rules for a ruleset generated by the pruning algorithm with different values of the improvement threshold and the confidence threshold. Each point has two circles. The background circle displays the number of rules and accuracy for no pruning; therefore all background circles are the same. The front circle displays the rules and accuracy when pruning is applied. The accuracy is indicated by the colour shown in the scale-bar on the right hand side. Pruning conditions that are more accurate than the unpruned condition are indicated with a black dot in the centre. The number of rules is indicated by the area of the circle (equal amount of ink for number of rules), normalised by the number of rules in the unpruned case. The results shown for both the accuracy and the number of rules are the average of 300 trial runs for each test condition.
  • Figure 3: Plot of IBS against percentage of data included in the stratified training data for the E.coli UCI dataset. Thirty trials for each condition were performed and error bars indicate one standard deviation across the trials. Pruned CON-FOLD used a confidence threshold of $0.65$ and a pruning threshold of $0.07$.
  • Figure 4: Each of the plots shows the performance of models using the Inverse Brier Score metric with different amounts of training data. Plots a and b show the regimes where large amounts of training data are available while plots e and f explore model performance with very small amounts of training data available. Plots a, c and e use automatic feature extraction, while plots b, d and f use manual feature extraction using regular expressions which allows for domain knowledge in the form of a marking scheme to be included. The total number of student responses was $n=1525$.

Theorems & Definitions (10)

  • theorem 1
  • proof
  • theorem 2
  • proof
  • theorem 3
  • proof
  • theorem 4
  • proof
  • theorem 5
  • proof