Learning Interpretable Rules for Scalable Data Representation and Classification

Zhuo Wang; Wei Zhang; Ning Liu; Jianyong Wang

Learning Interpretable Rules for Scalable Data Representation and Classification

Zhuo Wang, Wei Zhang, Ning Liu, Jianyong Wang

TL;DR

This work tackles the conflict between interpretability and scalability in rule-based classification by introducing the Rule-based Representation Learner (RRL), a hierarchical model that learns interpretable rules through a discrete representation learned via end-to-end feature discretization. Training employs Gradient Grafting to directly optimize the discrete model in tandem with a differentiable continuous surrogate, aided by novel Logical Activation Functions that mitigate vanishing gradients and enable scalable computation. Empirical results across 14 datasets show that RRL often outperforms existing interpretable methods and approaches the performance of strong ensemble and neural models, while allowing explicit rule-based interpretation. The approach offers a practical path to scalable, transparent models with controllable complexity suitable for domains demanding explanation-rich decision making.

Abstract

Rule-based models, e.g., decision trees, are widely used in scenarios demanding high model interpretability for their transparent inner structures and good model expressivity. However, rule-based models are hard to optimize, especially on large data sets, due to their discrete parameters and structures. Ensemble methods and fuzzy/soft rules are commonly used to improve performance, but they sacrifice the model interpretability. To obtain both good scalability and interpretability, we propose a new classifier, named Rule-based Representation Learner (RRL), that automatically learns interpretable non-fuzzy rules for data representation and classification. To train the non-differentiable RRL effectively, we project it to a continuous space and propose a novel training method, called Gradient Grafting, that can directly optimize the discrete model using gradient descent. A novel design of logical activation functions is also devised to increase the scalability of RRL and enable it to discretize the continuous features end-to-end. Exhaustive experiments on ten small and four large data sets show that RRL outperforms the competitive interpretable approaches and can be easily adjusted to obtain a trade-off between classification accuracy and model complexity for different scenarios. Our code is available at: https://github.com/12wang3/rrl.

Learning Interpretable Rules for Scalable Data Representation and Classification

TL;DR

Abstract

Paper Structure (29 sections, 21 equations, 12 figures, 5 tables)

This paper contains 29 sections, 21 equations, 12 figures, 5 tables.

Introduction
Related Work
Rule-based Representation Learner
Logical Layer
Discrete Version
Continuous Version
Novel Logical Activation Functions
Binarization Layer
Gradient Grafting
Single Gradient Grafting
Hierarchical Gradient Grafting
Model Interpretation
Experiments
Dataset Description and Experimental Settings
Classification Performance
...and 14 more sections

Figures (12)

Figure 1: A Rule-based Representation Learner example. The dashed box shows an example of a discrete logical layer and its corresponding rules.
Figure 2: Simplified computation graphs of Gradient Grafting. The left graph is an example of single grafting, while the right graph is an example of hierarchical grafting for the same layers. Solid arrows with solid lines represent forward pass, while solid arrows with dashed lines represent backpropagation. Each hollow arrow connects a gradient grafting pair. In one pair, the red arrow denotes the grafted gradient, a copy of the gradient represented by the blue arrow. Circles represent differentiable functions, while squares represent non-differentiable functions. After grafting, there exists a backward path from the loss function to all the parameters. We omit the linear layer for better understanding. LAF: Logical Activation Function; LO: Logical Operation; q: quantizer (binarization function).
Figure 3: Scatter plot of F1 score against log(#edges) for RRL and baselines on eight datasets (see Appendix \ref{['appendix:model_complexity']} for other datasets). F1 score and log(#edges) are used to evaluate the classification performance and the model complexity, respectively.
Figure 4: Training losses of three compared discrete model training methods and Gradient Grafting with original logical activation functions, i.e., GradGraft (OLAF), or novel logical activation functions, i.e., GradGraft (NLAF), on six data sets. The losses are plotted on a log scale for a better viewing experience, and fluctuations at the bottom are actually much smaller than those at the top.
Figure 5: Training losses of the single and the hierarchical Gradient Grafting on RRL with different depths (i.e., the number of logical layers). The legend labels show the number of nodes in each logical layer. E.g., 512_512_512 represents three logical layers, and each layer has 512 nodes. For a better viewing experience, we plot the losses on a log scale. Hence, fluctuations at the bottom are actually much smaller than those at the top.
...and 7 more figures

Learning Interpretable Rules for Scalable Data Representation and Classification

TL;DR

Abstract

Learning Interpretable Rules for Scalable Data Representation and Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (12)