Table of Contents
Fetching ...

Hacking a surrogate model approach to XAI

Alexander Wilhelm, Katharina A. Zweig

TL;DR

This article shows that even if the discriminated subgroup - while otherwise being the same in all categories - does not get a single positive decision from the black box ADM system, the corresponding question of group membership can be pushed down onto a level as low as wanted by the operator of the system.

Abstract

In recent years, the number of new applications for highly complex AI systems has risen significantly. Algorithmic decision-making systems (ADMs) are one of such applications, where an AI system replaces the decision-making process of a human expert. As one approach to ensure fairness and transparency of such systems, explainable AI (XAI) has become more important. One variant to achieve explainability are surrogate models, i.e., the idea to train a new simpler machine learning model based on the input-output-relationship of a black box model. The simpler machine learning model could, for example, be a decision tree, which is thought to be intuitively understandable by humans. However, there is not much insight into how well the surrogate model approximates the black box. Our main assumption is that a good surrogate model approach should be able to bring such a discriminating behavior to the attention of humans; prior to our research we assumed that a surrogate decision tree would identify such a pattern on one of its first levels. However, in this article we show that even if the discriminated subgroup - while otherwise being the same in all categories - does not get a single positive decision from the black box ADM system, the corresponding question of group membership can be pushed down onto a level as low as wanted by the operator of the system. We then generalize this finding to pinpoint the exact level of the tree on which the discriminating question is asked and show that in a more realistic scenario, where discrimination only occurs to some fraction of the disadvantaged group, it is even more feasible to hide such discrimination. Our approach can be generalized easily to other surrogate models.

Hacking a surrogate model approach to XAI

TL;DR

This article shows that even if the discriminated subgroup - while otherwise being the same in all categories - does not get a single positive decision from the black box ADM system, the corresponding question of group membership can be pushed down onto a level as low as wanted by the operator of the system.

Abstract

In recent years, the number of new applications for highly complex AI systems has risen significantly. Algorithmic decision-making systems (ADMs) are one of such applications, where an AI system replaces the decision-making process of a human expert. As one approach to ensure fairness and transparency of such systems, explainable AI (XAI) has become more important. One variant to achieve explainability are surrogate models, i.e., the idea to train a new simpler machine learning model based on the input-output-relationship of a black box model. The simpler machine learning model could, for example, be a decision tree, which is thought to be intuitively understandable by humans. However, there is not much insight into how well the surrogate model approximates the black box. Our main assumption is that a good surrogate model approach should be able to bring such a discriminating behavior to the attention of humans; prior to our research we assumed that a surrogate decision tree would identify such a pattern on one of its first levels. However, in this article we show that even if the discriminated subgroup - while otherwise being the same in all categories - does not get a single positive decision from the black box ADM system, the corresponding question of group membership can be pushed down onto a level as low as wanted by the operator of the system. We then generalize this finding to pinpoint the exact level of the tree on which the discriminating question is asked and show that in a more realistic scenario, where discrimination only occurs to some fraction of the disadvantaged group, it is even more feasible to hide such discrimination. Our approach can be generalized easily to other surrogate models.

Paper Structure

This paper contains 13 sections, 3 theorems, 18 equations, 7 figures, 1 algorithm.

Key Result

Theorem 1

If the attributes are statistically independent of each other, $p_{i'} < p_{i"} < p_{i"'} <...$ determines the order in which the attributes are used in a decision tree from root to leaf. If they are dependent, for each subset of data points reaching the node, the attribute with $\min p_i$ in that

Figures (7)

  • Figure 1: A surrogate model is trained with an input data set and its labels, which are assigned by the black box model. It is assumed that the surrogate model thereby approximates the logic involved of the black box model. The surrogate model is then used to try to explain the inner logic of the black box model.
  • Figure 2: A sample data set of creatures where only elves that earn more than 10 coins are considered creditworthy.
  • Figure 3: Two possible decision trees that show a decision about creditworthiness based on the species and salary of a creature.
  • Figure 4: Difference of the Gini impurities of the attribute ‘species’ and ‘salary’. Where the difference is positive, the attribute ‘salary’ would be chosen in the root of a decision tree. Where the difference is negative, the attribute ‘species’ would be chosen, respectively.
  • Figure 5: A data set of creatures where all creatures with a salary of 10 coins and elves with salary of 5 coins are considered creditworthy.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • Theorem 3