Table of Contents
Fetching ...

A Scalable Approach to Clustering Embedding Projections

Donghao Ren, Fred Hohman, Dominik Moritz

TL;DR

This work tackles the scalability bottleneck of labeling embedding projection visualizations by clustering in the 2D projected space using kernel density estimation (KDE) on a density map rather than on point data. The method produces high‑quality, polygonal cluster regions in a density map within a few hundred milliseconds, enabling fast labeling and summarization for datasets with millions of points. The approach combines hill‑climbing to form initial regions, a cluster‑neighborhood graph to merge near‑boundary maxima, and density‑based truncation to yield clean boundaries, with a final post‑processing step to polygons for downstream querying. An open‑source Rust implementation compiled to WebAssembly demonstrates strong runtime performance (roughly 55 ms for density map processing, with total interactive times around 80–100 ms) and comparable clustering quality to existing point‑based methods, supporting practical interactive visualization and SQL‑based labeling workflows.

Abstract

Interactive visualization of embedding projections is a useful technique for understanding data and evaluating machine learning models. Labeling data within these visualizations is critical for interpretation, as labels provide an overview of the projection and guide user navigation. However, most methods for producing labels require clustering the points, which can be computationally expensive as the number of points grows. In this paper, we describe an efficient clustering approach using kernel density estimation in the projected 2D space instead of points. This algorithm can produce high-quality cluster regions from a 2D density map in a few hundred milliseconds, orders of magnitude faster than current approaches. We contribute the design of the algorithm, benchmarks, and applications that demonstrate the utility of the algorithm, including labeling and summarization.

A Scalable Approach to Clustering Embedding Projections

TL;DR

This work tackles the scalability bottleneck of labeling embedding projection visualizations by clustering in the 2D projected space using kernel density estimation (KDE) on a density map rather than on point data. The method produces high‑quality, polygonal cluster regions in a density map within a few hundred milliseconds, enabling fast labeling and summarization for datasets with millions of points. The approach combines hill‑climbing to form initial regions, a cluster‑neighborhood graph to merge near‑boundary maxima, and density‑based truncation to yield clean boundaries, with a final post‑processing step to polygons for downstream querying. An open‑source Rust implementation compiled to WebAssembly demonstrates strong runtime performance (roughly 55 ms for density map processing, with total interactive times around 80–100 ms) and comparable clustering quality to existing point‑based methods, supporting practical interactive visualization and SQL‑based labeling workflows.

Abstract

Interactive visualization of embedding projections is a useful technique for understanding data and evaluating machine learning models. Labeling data within these visualizations is critical for interpretation, as labels provide an overview of the projection and guide user navigation. However, most methods for producing labels require clustering the points, which can be computationally expensive as the number of points grows. In this paper, we describe an efficient clustering approach using kernel density estimation in the projected 2D space instead of points. This algorithm can produce high-quality cluster regions from a 2D density map in a few hundred milliseconds, orders of magnitude faster than current approaches. We contribute the design of the algorithm, benchmarks, and applications that demonstrate the utility of the algorithm, including labeling and summarization.

Paper Structure

This paper contains 19 sections, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: A visual explanation of the algorithm. Starting from a KDE of the (A) projected data, the algorithm first (B) divides the density data into regions using a disjoint set, (C) merges trivial regions with larger regions, and finally (D) truncates the regions by density levels into clusters.
  • Figure 2: A comparison between (A) our algorithm and (B) supercluster, a popular library to cluster 2D points by Mapbox. For comparability, we assign a unique cluster id for points within each cluster region discovered by our algorithm, and use the cluster ids returned by supercluster. We also adjust the bandwidth and zoom level of the two algorithms to produce similar sized clusters. Since there are more clusters than colors that can be visually differentiated, we use a 10 color palette and ensure that adjacent clusters do not share the same color. Our algorithm takes 84ms to produce these clusters (time to compute KDE included), whereas supercluster takes 913ms to get the clusters and an additional 247ms to collect all the points from clusters.
  • Figure 3: An example combining the clustering algorithm with automatic labeling. Top: We compute clusters for the UltraChat-200k dataset and label each cluster with a class-based TF-IDF method. Bottom: An illustration of one approach to query text data for generating labels. Starting with the cluster boundary polygon, we approximate the polygon with a set of axis-aligned rectangles, and then generate an SQL query with predicates testing for each rectangle. The WHERE clause in this query is then used to compute the TF metric for each word. The entire label generation process can be implemented as SQL queries.