Table of Contents
Fetching ...

Extracting PAC Decision Trees from Black Box Binary Classifiers: The Gender Bias Case Study on BERT-based Language Models

Ana Ozaki, Roberto Confalonieri, Ricardo Guimarães, Anders Imenes

TL;DR

This work introduces PAC-based guarantees for extracting decision-tree surrogates from binary black-box classifiers, addressing fidelity between the surrogate and the target model. It combines theoretical developments (definitions, sample-size bounds, and PAC guarantees) with practical algorithms (TopDown and TrePAC) to produce interpretable trees. The authors validate the approach on a gender-bias case study using BERT/RoBERTa models, showing feasible data requirements and surrogate trees that reveal bias signals, while highlighting model complexity effects. The results demonstrate both the theoretical viability of PAC-guaranteed tree extraction and its practical utility for bias analysis and explainability, with future directions toward multi-class settings and alternative sampling strategies.

Abstract

Decision trees are a popular machine learning method, known for their inherent explainability. In Explainable AI, decision trees can be used as surrogate models for complex black box AI models or as approximations of parts of such models. A key challenge of this approach is determining how accurately the extracted decision tree represents the original model and to what extent it can be trusted as an approximation of their behavior. In this work, we investigate the use of the Probably Approximately Correct (PAC) framework to provide a theoretical guarantee of fidelity for decision trees extracted from AI models. Based on theoretical results from the PAC framework, we adapt a decision tree algorithm to ensure a PAC guarantee under certain conditions. We focus on binary classification and conduct experiments where we extract decision trees from BERT-based language models with PAC guarantees. Our results indicate occupational gender bias in these models.

Extracting PAC Decision Trees from Black Box Binary Classifiers: The Gender Bias Case Study on BERT-based Language Models

TL;DR

This work introduces PAC-based guarantees for extracting decision-tree surrogates from binary black-box classifiers, addressing fidelity between the surrogate and the target model. It combines theoretical developments (definitions, sample-size bounds, and PAC guarantees) with practical algorithms (TopDown and TrePAC) to produce interpretable trees. The authors validate the approach on a gender-bias case study using BERT/RoBERTa models, showing feasible data requirements and surrogate trees that reveal bias signals, while highlighting model complexity effects. The results demonstrate both the theoretical viability of PAC-guaranteed tree extraction and its practical utility for bias analysis and explainability, with future directions toward multi-class settings and alternative sampling strategies.

Abstract

Decision trees are a popular machine learning method, known for their inherent explainability. In Explainable AI, decision trees can be used as surrogate models for complex black box AI models or as approximations of parts of such models. A key challenge of this approach is determining how accurately the extracted decision tree represents the original model and to what extent it can be trusted as an approximation of their behavior. In this work, we investigate the use of the Probably Approximately Correct (PAC) framework to provide a theoretical guarantee of fidelity for decision trees extracted from AI models. Based on theoretical results from the PAC framework, we adapt a decision tree algorithm to ensure a PAC guarantee under certain conditions. We focus on binary classification and conduct experiments where we extract decision trees from BERT-based language models with PAC guarantees. Our results indicate occupational gender bias in these models.

Paper Structure

This paper contains 20 sections, 6 theorems, 32 equations, 6 figures, 5 tables, 2 algorithms.

Key Result

Theorem 8

Let $\mathcal{H}\xspace\xspace$ be a finite hypothesis space from a concept class. Let $\delta,\epsilon \in (0, 1/2)$, let $m,k\in\mathbb{N}$ be such that $m \geq k/\epsilon$ and Then, for any $t\in\mathcal{H}\xspace$ and for any distribution, $\mathcal{D}\xspace$, with probability of at least $1 -\delta$ over the choice of an i.i.d. sample $\mathcal{S}$ of size $m$, we have that $\mathop{\mathrm

Figures (6)

  • Figure 1: TrePAC training error for $k=[0,5,10,15]$ and $n=[3,6,10,18]$. Increasing the number of internal nodes reduces the training error. The horizontal dotted line correspond to $\epsilon=0.2$. See also \ref{['tab:sample']}.
  • Figure 2: TrePAC misclassified training examples for $k=[0,5,10,15]$ and $n=[3,6,10,18]$. The number of misclassified examples is at most $k$ (horizontal dotted line) when an appropriate tree size is chosen.
  • Figure 3: Surrogate tree for RoBERTa-base. The tree is extracted with $n=3$ and $m=277$ ($k=10$), see \ref{['tab:sample']}.
  • Figure 4: TrePAC true error for $k=[0,5,10,15]$ and $n=[3,6,10,18]$. Increasing the number of nodes decreases the true error.
  • Figure 5: TrePAC total number of misclassified examples for $k=[0,5,10,15]$ and $n=[3,6,10,18]$. Increasing the number of internal nodes of the tree reduces the total number of misclassified examples.
  • ...and 1 more figures

Theorems & Definitions (24)

  • Definition 1
  • Definition 2: Induced Probability Distribution
  • Definition 3: Example Query
  • Definition 4: PAC learnability
  • Definition 5: True Error
  • Definition 6: Training Error
  • Definition 7: Membership Query
  • Theorem 8: Sample Size with Training Error
  • proof
  • Definition 9: Fidelity and Probabilistic Fidelity
  • ...and 14 more