Table of Contents
Fetching ...

Divisi: Interactive Search and Visualization for Scalable Exploratory Subgroup Analysis

Venkatesh Sivaraman, Zexuan Li, Adam Perer

TL;DR

Divisi introduces an interactive notebook-based system for exploratory subgroup analysis in high-dimensional datasets, underpinned by a fast approximate subgroup discovery algorithm and a Subgroup Map visualization. It reframes subgroup analysis as an exploratory data analysis task, enabling mixed-initiative discovery, evaluation, and curation through configurable ranking functions and interactive rule editing. Empirical results from performance evaluation and a think-aloud user study with 13 data scientists demonstrate Divisi’s ability to surface unexpected patterns, support testing of feature interactions, and help practitioners curate representative subgroups for stakeholders. The work provides a practical workflow, algorithmic innovations (configurable approximation, multiple ranking criteria), and actionable insights for integrating exploratory subgroup analysis into real-world data science practice, with open-source availability for broader adoption.

Abstract

Analyzing data subgroups is a common data science task to build intuition about a dataset and identify areas to improve model performance. However, subgroup analysis is prohibitively difficult in datasets with many features, and existing tools limit unexpected discoveries by relying on user-defined or static subgroups. We propose exploratory subgroup analysis as a set of tasks in which practitioners discover, evaluate, and curate interesting subgroups to build understanding about datasets and models. To support these tasks we introduce Divisi, an interactive notebook-based tool underpinned by a fast approximate subgroup discovery algorithm. Divisi's interface allows data scientists to interactively re-rank and refine subgroups and to visualize their overlap and coverage in the novel Subgroup Map. Through a think-aloud study with 13 practitioners, we find that Divisi can help uncover surprising patterns in data features and their interactions, and that it encourages more thorough exploration of subtypes in complex data.

Divisi: Interactive Search and Visualization for Scalable Exploratory Subgroup Analysis

TL;DR

Divisi introduces an interactive notebook-based system for exploratory subgroup analysis in high-dimensional datasets, underpinned by a fast approximate subgroup discovery algorithm and a Subgroup Map visualization. It reframes subgroup analysis as an exploratory data analysis task, enabling mixed-initiative discovery, evaluation, and curation through configurable ranking functions and interactive rule editing. Empirical results from performance evaluation and a think-aloud user study with 13 data scientists demonstrate Divisi’s ability to surface unexpected patterns, support testing of feature interactions, and help practitioners curate representative subgroups for stakeholders. The work provides a practical workflow, algorithmic innovations (configurable approximation, multiple ranking criteria), and actionable insights for integrating exploratory subgroup analysis into real-world data science practice, with open-source availability for broader adoption.

Abstract

Analyzing data subgroups is a common data science task to build intuition about a dataset and identify areas to improve model performance. However, subgroup analysis is prohibitively difficult in datasets with many features, and existing tools limit unexpected discoveries by relying on user-defined or static subgroups. We propose exploratory subgroup analysis as a set of tasks in which practitioners discover, evaluate, and curate interesting subgroups to build understanding about datasets and models. To support these tasks we introduce Divisi, an interactive notebook-based tool underpinned by a fast approximate subgroup discovery algorithm. Divisi's interface allows data scientists to interactively re-rank and refine subgroups and to visualize their overlap and coverage in the novel Subgroup Map. Through a think-aloud study with 13 practitioners, we find that Divisi can help uncover surprising patterns in data features and their interactions, and that it encourages more thorough exploration of subtypes in complex data.

Paper Structure

This paper contains 32 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Proposed workflow for exploratory subgroup analysis, adapted from pirolli_sensemaking_2005's sense-making framework pirolli_sensemaking_2005 and informed by three expert interviews.
  • Figure 2: Divisi's subgroup discovery algorithm takes as input a matrix of discrete-valued input features (A) and one or more score functions, in this case a Binary Outcome Rate score over the outcomes in (B). For each sampled row (C), the algorithm first scores each single-feature slice containing that row (D), then iteratively expands the top $k$ slices using other features that match the sampled row (E). In this example, $k = 2$ and the minimum slice size is 2 instances.
  • Figure 3: Average running times and accuracy (recall in top 50 returned results) for different parameter settings of Divisi, compared against a Lattice Search and Frequent Itemset approach. (We were unable to run the Frequent Itemset approach on the Reviews dataset due to excessive memory consumption, so we only report its performance on the Census Income and Airline datasets.) Shaded regions represent one standard deviation over 10 trials.
  • Figure 4: The Configuration sidebar (A) and the Subgroups Table (B) allow users to run the subgroup discovery algorithm and browse the rules it returns. For example, in the Census Income dataset, the first returned subgroup (C) represents people with no capital gains or losses who are married to a civilian spouse. This subgroup comprises 38% of the dataset, and has an error rate of 25.3%, compared to 11.6% in the overall Evaluation Set. By clicking the dropdown next to the marital-status feature (D), we can test alternative values for that feature.
  • Figure 5: Different states of the Subgroup Map on the UCI Census Income dataset adult_2: (A) an overview of the dataset with no subgroups selected, (B) intersections between three selected subgroups, and (C) highlighting the points that match a subgroup when hovered in the Subgroups Table. Filled-in bubbles indicate classification errors for the income prediction task; each bubble's size indicates the number of instances it contains.
  • ...and 2 more figures