Divisi: Interactive Search and Visualization for Scalable Exploratory Subgroup Analysis
Venkatesh Sivaraman, Zexuan Li, Adam Perer
TL;DR
Divisi introduces an interactive notebook-based system for exploratory subgroup analysis in high-dimensional datasets, underpinned by a fast approximate subgroup discovery algorithm and a Subgroup Map visualization. It reframes subgroup analysis as an exploratory data analysis task, enabling mixed-initiative discovery, evaluation, and curation through configurable ranking functions and interactive rule editing. Empirical results from performance evaluation and a think-aloud user study with 13 data scientists demonstrate Divisi’s ability to surface unexpected patterns, support testing of feature interactions, and help practitioners curate representative subgroups for stakeholders. The work provides a practical workflow, algorithmic innovations (configurable approximation, multiple ranking criteria), and actionable insights for integrating exploratory subgroup analysis into real-world data science practice, with open-source availability for broader adoption.
Abstract
Analyzing data subgroups is a common data science task to build intuition about a dataset and identify areas to improve model performance. However, subgroup analysis is prohibitively difficult in datasets with many features, and existing tools limit unexpected discoveries by relying on user-defined or static subgroups. We propose exploratory subgroup analysis as a set of tasks in which practitioners discover, evaluate, and curate interesting subgroups to build understanding about datasets and models. To support these tasks we introduce Divisi, an interactive notebook-based tool underpinned by a fast approximate subgroup discovery algorithm. Divisi's interface allows data scientists to interactively re-rank and refine subgroups and to visualize their overlap and coverage in the novel Subgroup Map. Through a think-aloud study with 13 practitioners, we find that Divisi can help uncover surprising patterns in data features and their interactions, and that it encourages more thorough exploration of subtypes in complex data.
