Table of Contents
Fetching ...

De-cluttering Scatterplots with Integral Images

Hennes Rave, Vladimir Molchanov, Lars Linsen

TL;DR

This work tackles overplotting in scatterplots by introducing a data-driven, global domain deformation that yields a density-equalized, near-uniform distribution of samples while preserving local neighborhood relations. The method builds and iteratively applies a deformation map derived from integral-image representations of a rasterized density, with a corrective term ensuring identity behavior under uniform density. A GPU-accelerated pipeline computes the required integral images and deformation efficiently, enabling interactive visual analysis of large datasets. The authors also explore visual encodings of the deformation (grid, density background, contours) and validate the approach through numerical benchmarks and a user study, demonstrating improved task performance over traditional opacity-based clutter reduction in many scenarios. The technique offers a scalable, deterministic alternative to existing clutter-reduction methods and opens avenues for applications like local lenses and contiguous cartograms.

Abstract

Scatterplots provide a visual representation of bivariate data (or 2D embeddings of multivariate data) that allows for effective analyses of data dependencies, clusters, trends, and outliers. Unfortunately, classical scatterplots suffer from scalability issues, since growing data sizes eventually lead to overplotting and visual clutter on a screen with a fixed resolution, which hinders the data analysis process. We propose an algorithm that compensates for irregular sample distributions by a smooth transformation of the scatterplot's visual domain. Our algorithm evaluates the scatterplot's density distribution to compute a regularization mapping based on integral images of the rasterized density function. The mapping preserves the samples' neighborhood relations. Few regularization iterations suffice to achieve a nearly uniform sample distribution that efficiently uses the available screen space. We further propose approaches to visually convey the transformation that was applied to the scatterplot and compare them in a user study. We present a novel parallel algorithm for fast GPU-based integral-image computation, which allows for integrating our de-cluttering approach into interactive visual data analysis systems.

De-cluttering Scatterplots with Integral Images

TL;DR

This work tackles overplotting in scatterplots by introducing a data-driven, global domain deformation that yields a density-equalized, near-uniform distribution of samples while preserving local neighborhood relations. The method builds and iteratively applies a deformation map derived from integral-image representations of a rasterized density, with a corrective term ensuring identity behavior under uniform density. A GPU-accelerated pipeline computes the required integral images and deformation efficiently, enabling interactive visual analysis of large datasets. The authors also explore visual encodings of the deformation (grid, density background, contours) and validate the approach through numerical benchmarks and a user study, demonstrating improved task performance over traditional opacity-based clutter reduction in many scenarios. The technique offers a scalable, deterministic alternative to existing clutter-reduction methods and opens avenues for applications like local lenses and contiguous cartograms.

Abstract

Scatterplots provide a visual representation of bivariate data (or 2D embeddings of multivariate data) that allows for effective analyses of data dependencies, clusters, trends, and outliers. Unfortunately, classical scatterplots suffer from scalability issues, since growing data sizes eventually lead to overplotting and visual clutter on a screen with a fixed resolution, which hinders the data analysis process. We propose an algorithm that compensates for irregular sample distributions by a smooth transformation of the scatterplot's visual domain. Our algorithm evaluates the scatterplot's density distribution to compute a regularization mapping based on integral images of the rasterized density function. The mapping preserves the samples' neighborhood relations. Few regularization iterations suffice to achieve a nearly uniform sample distribution that efficiently uses the available screen space. We further propose approaches to visually convey the transformation that was applied to the scatterplot and compare them in a user study. We present a novel parallel algorithm for fast GPU-based integral-image computation, which allows for integrating our de-cluttering approach into interactive visual data analysis systems.
Paper Structure (16 sections, 7 equations, 11 figures, 1 table)

This paper contains 16 sections, 7 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Reproduction from Molchanov and Linsen Molchanov20_wscg. Left: The four InIm coefficients computed at location $(x,y)$ stand for integrals of a density function over respective rectangular regions. Right: The four additional coefficients can be computed for the same location as integrals over tilted regions.
  • Figure 2: Efficient computation of InIms. \ref{['fig:vertcol_1']}--\ref{['fig:vertcol_3']} Computation of column integrals. Pixels highlighted in red iteratively accumulate texture values from the pixels located above them. Green pixels progressively sum up values from the pixel columns below. After $k$ summations are performed in a single rendering pass, every pixel contains upper- and lower-column sums of values stored in two texture channels. \ref{['fig:inim_1']}--\ref{['fig:inim_3']} Computation of InIms by iterative accumulation of column integrals. In the last iteration, red and green pixels contain values for $\alpha$ and $\gamma$, correspondingly. InIms for $\beta$ and $\delta$ can be computed analogously. Note that $\beta$, $\gamma$, and $\delta$ can be evaluated on demand using $\alpha$ only, thus, their explicit computation is not necessary. \ref{['fig:triag_1']}--\ref{['fig:triag_3']} Calculation of triangle integrals by summing up column integrals along diagonals. Two of the four required auxiliary integrals are shown. \ref{['fig:tilt_1']}--\ref{['fig:tilt_3']} Tilted InIms can be computed by simple arithmetic operations on precomputed column and triangle integrals. An example for calculating $\alpha_t$ is presented. Tilted InIms $\beta_t$, $\gamma_t$, and $\delta_t$ can be found analogously.
  • Figure 3: \ref{['fig::mnist_orig']} Original layout of the MNIST dataset (UMAP, number of neighbors $15$, minimal distance $0.1$) with color-coded classes. \ref{['fig::mnist_grid']} Visual encoding of the density-equalizing transform using grid lines after $32$ iterations. The original density of samples is represented by the background texture in \ref{['fig::mnist_density']} and by contour lines in \ref{['fig::mnist_contour']}. The last two figures allow for analyzing the subcluster structures occluded in \ref{['fig::mnist_orig']}.
  • Figure 4: Regularization of data sampled roughly along the domain diagonal. A superimposed regular grid conveys the domain deformation. Left: Original scatterplot. Middle: After 2 iterations. Right: After 8 iterations.
  • Figure 5: Regularization of samples' distribution in scatterplot. \ref{['fig::reg_0']} Original scatterplot depicts four clusters shown in blue (400k samples), red (300k samples), green (200k samples), and orange (100k samples). Visual estimation of cluster sizes as well as access to individual samples are hindered by excessive overplotting. \ref{['fig::reg_1']}-\ref{['fig::reg_16']} Iterative transformation of scatterplots using the proposed de-cluttering algorithm after 1 (b), 2 (c), 4 (d), 8 (e), and 16 (f) iterations. Computational times are $1.15$ ms, $2.99$ ms, $6.22$ ms, $12.76$ ms, and $25.24$ ms respectively. After a few iterations, data clusters occupy areas proportional to the number of samples contained in them. No mixing of clusters takes place. A superimposed regular grid is deformed using the same mapping. The shape of the deformed grid represents the computed mapping and may serve for the identification of the original data clusters even if they were not color-coded. \ref{['fig::keim_0']}-\ref{['fig::keim_2']} Generalized Scatterplots proposed by Keim et al. Keim10 demonstrate noticeably less efficient use of the screen space for any combination of the governing parameters : (g) distortion $=1$, overlap $=0.1$, (h) distortion $=0.5$, overlap $=0.05$, (i) distortion $=0$, overlap $=0.1$.
  • ...and 6 more figures