Table of Contents
Fetching ...

Multi-domain performance analysis with scores tailored to user preferences

Sébastien Piérard, Adrien Deliège, Marc Van Droogenbroeck

TL;DR

The paper tackles uncertainty in domain distributions and user tradeoffs by formulating a probabilistic framework where domain performances are averaged via a summarization. It shows that for several score families, the averaged performance can be expressed as a weighted sum of domain scores, with explicit weights for unconditional and ratio-based scores. By introducing ranking-based scores and a user-preference model, it defines easy, difficult, preponderant, and bottleneck domains, and provides visualization tools (Tile flavors) in the two-class crisp classification setting. These contributions yield a practical, user-centered method for multi-domain performance analysis and actionable visuals to guide development and improvement priorities.

Abstract

The performance of algorithms, methods, and models tends to depend heavily on the distribution of cases on which they are applied, this distribution being specific to the applicative domain. After performing an evaluation in several domains, it is highly informative to compute a (weighted) mean performance and, as shown in this paper, to scrutinize what happens during this averaging. To achieve this goal, we adopt a probabilistic framework and consider a performance as a probability measure (e.g., a normalized confusion matrix for a classification task). It appears that the corresponding weighted mean is known to be the summarization, and that only some remarkable scores assign to the summarized performance a value equal to a weighted arithmetic mean of the values assigned to the domain-specific performances. These scores include the family of ranking scores, a continuum parameterized by user preferences, and that the weights to consider in the arithmetic mean depend on the user preferences. Based on this, we rigorously define four domains, named easiest, most difficult, preponderant, and bottleneck domains, as functions of user preferences. After establishing the theory in a general setting, regardless of the task, we develop new visual tools for two-class classification.

Multi-domain performance analysis with scores tailored to user preferences

TL;DR

The paper tackles uncertainty in domain distributions and user tradeoffs by formulating a probabilistic framework where domain performances are averaged via a summarization. It shows that for several score families, the averaged performance can be expressed as a weighted sum of domain scores, with explicit weights for unconditional and ratio-based scores. By introducing ranking-based scores and a user-preference model, it defines easy, difficult, preponderant, and bottleneck domains, and provides visualization tools (Tile flavors) in the two-class crisp classification setting. These contributions yield a practical, user-centered method for multi-domain performance analysis and actionable visuals to guide development and improvement priorities.

Abstract

The performance of algorithms, methods, and models tends to depend heavily on the distribution of cases on which they are applied, this distribution being specific to the applicative domain. After performing an evaluation in several domains, it is highly informative to compute a (weighted) mean performance and, as shown in this paper, to scrutinize what happens during this averaging. To achieve this goal, we adopt a probabilistic framework and consider a performance as a probability measure (e.g., a normalized confusion matrix for a classification task). It appears that the corresponding weighted mean is known to be the summarization, and that only some remarkable scores assign to the summarized performance a value equal to a weighted arithmetic mean of the values assigned to the domain-specific performances. These scores include the family of ranking scores, a continuum parameterized by user preferences, and that the weights to consider in the arithmetic mean depend on the user preferences. Based on this, we rigorously define four domains, named easiest, most difficult, preponderant, and bottleneck domains, as functions of user preferences. After establishing the theory in a general setting, regardless of the task, we develop new visual tools for two-class classification.

Paper Structure

This paper contains 12 sections, 11 equations, 2 figures.

Figures (2)

  • Figure 1: Tiles showing the summarization on 3 domains: by multiplying each domain-specific Value Tile by the corresponding Summarization Weight Tile and adding the results together, one obtains the Summarized Value Tile, i.e. the Value Tile for $\overline{P}$. This Tile is exactly the same as the Value Tile that would be obtained from $\overline{P}$ after computing it with \ref{['eq:summarization']}. However, by scrutinizing what happens during the performance averaging, it becomes clear that the actual weights to consider strongly depend on the user preferences (i.e., point in the Tile), which is something hidden in \ref{['eq:summarization']}.
  • Figure 2: In this work, we propose four new flavors for the Tile, as visual tools to perform multi-domain performance analyses w.r.t. user preferences. These are the Tiles obtained for the example of \ref{['fig:summarization']}. Notably, this analysis covers the two sources of uncertainty met in a model development: the domain dimension (the colors on the Tiles), and the user preferences (the position on the Tiles).