Performance is not enough: the story told by a Rashomon quartet

Przemyslaw Biecek; Hubert Baniecki; Mateusz Krzyzinski; Dianne Cook

Performance is not enough: the story told by a Rashomon quartet

Przemyslaw Biecek, Hubert Baniecki, Mateusz Krzyzinski, Dianne Cook

TL;DR

A Rashomon Quartet is introduced, that is a set of four models built on a synthetic dataset which have practically identical predictive performance, however, the visual exploration reveals distinct explanations of the relations in the data.

Abstract

The usual goal of supervised learning is to find the best model, the one that optimizes a particular performance measure. However, what if the explanation provided by this model is completely different from another model and different again from another model despite all having similarly good fit statistics? Is it possible that the equally effective models put the spotlight on different relationships in the data? Inspired by Anscombe's quartet, this paper introduces a Rashomon Quartet, i.e. a set of four models built on a synthetic dataset which have practically identical predictive performance. However, the visual exploration reveals distinct explanations of the relations in the data. This illustrative example aims to encourage the use of methods for model visualization to compare predictive models beyond their performance.

Performance is not enough: the story told by a Rashomon quartet

TL;DR

Abstract

Paper Structure (14 sections, 9 equations, 7 figures, 1 table)

This paper contains 14 sections, 9 equations, 7 figures, 1 table.

Supplementary materials
Acknowledgements
Appendix A: Rashomon Couple
Appendix B: Analysis of model residuals
Appendix C: How to engineer your own Rashomon quartet
Appendix D: Reproducibility of the results
Appendix E: Variable distributions

Figures (7)

Figure 1: Illustration of the Rashomon effect: equally effective models, each telling a different story. Dashed contours indicate an equal value of a loss function calculated on a validation dataset. Squares indicate "best in class" models trained on the same data. The Rashomon Quartet is designed so that the best models have an equal loss function on the validation data, but each model from the set describes a different perspective.
Figure 2: Overview of the Rashomon Quartet: four models of different types -- linear model, decision tree, neural network, and random forest -- with the same predictive performance ($R^2=0.729$, $RMSE=0.354$) but different behaviors. Each panel shows partial dependence profiles for the three variables $x_1$, $x_2$, and $x_3$. All models agree that $x_1$ is strongly linked with $y$ but disagree on whether the relation is linear. The models disagree on how variables $x_2$ and $x_3$ are related to $y$.
Figure 3: Partial dependence profiles with point-wise 95%-confidence intervals shown by line thickness. There is very little overlap between intervals which suggests that the models are confidently different in their view of the relationship with $y$.
Figure 4: Example of a Rashomon couple. The black curve corresponds to the true data generating function $f(x) = sign(x){\lvert x \rvert}^{\frac{\sqrt{3}-1}{2}}$, while the blue curve corresponds to the best model against the OLS criterion in the family $\mathcal{F}_{LM}(b_1)$ of linear functions, and the red curve in the family $\mathcal{F}_{BT}(b_0)$ of shallow binary trees.
Figure 5: The parallel coordinate plot depicts ranges for residuals for different models, one range per observation ordered along the mean value. The second panel shows the difference between model averages and standard deviations for residuals, one point per observation. The following panels show the dendrogram and PCA for residuals.
...and 2 more figures

Performance is not enough: the story told by a Rashomon quartet

TL;DR

Abstract

Performance is not enough: the story told by a Rashomon quartet

Authors

TL;DR

Abstract

Table of Contents

Figures (7)