Table of Contents
Fetching ...

Set Visualizations for Comparing and Evaluating Machine Learning Models

Liudas Panavas, Tarik Crnovrsanin, Racquel Fygenson, Eamon Conway, Derek Millard, Norbou Buchler, Cody Dunne

TL;DR

The paper tackles the challenge of comparing multiple predictors by transforming their outputs into sets and visualizing their intersections, enabling direct model-to-model comparisons. It formalizes a four-criteria method for creating set-type data and introduces SetMLVis, an UpSet-style interactive tool tailored for object-detection evaluation. Through a mixed-methods study against a traditional visualization baseline, the authors show that set visualizations improve task accuracy and reduce cognitive workload, especially on complex analyses. The work contributes a general methodology for set-based model comparison, an open-source tool integrated with Jupyter notebooks, and actionable insights for practitioners seeking more interpretable, scalable evaluation workflows.

Abstract

Machine learning practitioners often need to compare multiple models to select the best one for their application. However, current methods of comparing models fall short because they rely on aggregate metrics that can be difficult to interpret or do not provide enough information to understand the differences between models. To better support the comparison of models, we propose set visualizations of model outputs to enable easier model-to-model comparison. We outline the requirements for using sets to compare machine learning models and demonstrate how this approach can be applied to various machine learning tasks. We also introduce SetMLVis, an interactive system that utilizes set visualizations to compare object detection models. Our evaluation shows that SetMLVis outperforms traditional visualization techniques in terms of task completion and reduces cognitive workload for users. Supplemental materials can be found at https://osf.io/afksu/?view_only=bb7f259426ad425f81d0518a38c597be.

Set Visualizations for Comparing and Evaluating Machine Learning Models

TL;DR

The paper tackles the challenge of comparing multiple predictors by transforming their outputs into sets and visualizing their intersections, enabling direct model-to-model comparisons. It formalizes a four-criteria method for creating set-type data and introduces SetMLVis, an UpSet-style interactive tool tailored for object-detection evaluation. Through a mixed-methods study against a traditional visualization baseline, the authors show that set visualizations improve task accuracy and reduce cognitive workload, especially on complex analyses. The work contributes a general methodology for set-based model comparison, an open-source tool integrated with Jupyter notebooks, and actionable insights for practitioners seeking more interpretable, scalable evaluation workflows.

Abstract

Machine learning practitioners often need to compare multiple models to select the best one for their application. However, current methods of comparing models fall short because they rely on aggregate metrics that can be difficult to interpret or do not provide enough information to understand the differences between models. To better support the comparison of models, we propose set visualizations of model outputs to enable easier model-to-model comparison. We outline the requirements for using sets to compare machine learning models and demonstrate how this approach can be applied to various machine learning tasks. We also introduce SetMLVis, an interactive system that utilizes set visualizations to compare object detection models. Our evaluation shows that SetMLVis outperforms traditional visualization techniques in terms of task completion and reduces cognitive workload for users. Supplemental materials can be found at https://osf.io/afksu/?view_only=bb7f259426ad425f81d0518a38c597be.

Paper Structure

This paper contains 32 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: When comparing object detection models, practitioners look at bounding box overlays on images (panel (a)). These visualizations can quickly become convoluted and make extracting meaningful comparisons difficult. By matching predictions from the different models and placing them into sets (panel (b)), we can create visualizations that make it easier to compare the models. Using an Upset style visualizations (panels (c)) the predictions can be analyzed through meaningful subsets of the data. For instance, examining the predictions in the two bars shows Models Yellow and Pink excel in detecting cars, while Model Green is adept at identifying vans (panel (d)). This innovative use of set visualizations provides clear, actionable insights when comparing model performance.
  • Figure 2: While all three models have the same average precision of 0.5, this metric hides key differences in their predictions. A closer look at the instance-level data shows that Model Pink consistently detects car fronts, whereas Models Yellow and Green focus on car backs. Relying on aggregate metrics alone misses these important distinctions, which become apparent only through detailed analysis of model outputs.
  • Figure 3: On the top, we see the results of a benchmark study on object detection models where each row represents a model and the columns are precision for a specific class ding2021object. While these types of tables are popular, it can be difficult to grasp the differences between models. On the bottom, the challenges from viewing instance-level predictions are apparent. Predictions for cars from Faster R-CNN ren2015faster, ResNet he2016deep, and DETR carion2020end are shown. The dense overlays make it difficult to compare and understand the performance differences between these models.
  • Figure 4: Illustration of set creation from model predictions. Top: The image is classified as a dog by both models, forming a set of model agreement. Bottom: Discrepant predictions from Model A (mammal) and Model B (reptile) lead to the formation of distinct sets for each model's unique prediction.
  • Figure 5: Example of using UpSet visualization to compare model predictions. (1) All three models make different predictions for the same image. (2) Models A and B predict a dog, while Model C predicts a crocodile. (3) Set of predictions where all models agree, indicating a dog. (4) Selecting this set displays all images where all three models agree on the prediction of a dog, showing that models consistently identify a dog when not submerged in water.
  • ...and 8 more figures