Table of Contents
Fetching ...

Gatherplot: A Non-Overlapping Scatterplot

Deokgun Park, Sung-Hee Kim, Niklas Elmqvist

TL;DR

The paper addresses overplotting in scatterplots, especially with categorical or duplicated values, by introducing the Gather Transformation that partitions axes into segments and maps data points within those segments to pack marks without aggregation. This yields gatherplots, a 2D gathering representation with layout modes (Absolute, Normalized, Streamgraph) and a local GatherLens interaction for targeted control, all implemented in a D3/Angular prototype. A crowdsourced study demonstrates that gatherplots improve accuracy and user confidence over jittered scatterplots, with mode choices aligned to specific tasks. The approach preserves object identity, supports continuous variables through binning, and offers practical advantages for multidimensional exploration, with future work extending gathering principles to parallel coordinates and additional interactions.

Abstract

Scatterplots are a common tool for exploring multidimensional datasets, especially in the form of scatterplot matrices (SPLOMs). However, scatterplots suffer from overplotting when categorical variables are mapped to one or two axes, or the same continuous variable is used for both axes. Previous methods such as histograms or violin plots use aggregation, which makes brushing and linking difficult. To address this, we propose gatherplots, an extension of scatterplots to manage the overplotting problem. Gatherplots are a form of unit visualization, which avoid aggregation and maintain the identity of individual objects to ease visual perception. In gatherplots, every visual mark that maps to the same position coalesces to form a packed entity, thereby making it easier to see the overview of data groupings. The size and aspect ratio of marks can also be changed dynamically to make it easier to compare the composition of different groups. In the case of a categorical variable vs. a categorical variable, we propose a heuristic to decide bin sizes for optimal space usage. To validate our work, we conducted a crowdsourced user study that shows that gatherplots enable people to assess data distribution more quickly and more correctly than when using jittered scatterplots.

Gatherplot: A Non-Overlapping Scatterplot

TL;DR

The paper addresses overplotting in scatterplots, especially with categorical or duplicated values, by introducing the Gather Transformation that partitions axes into segments and maps data points within those segments to pack marks without aggregation. This yields gatherplots, a 2D gathering representation with layout modes (Absolute, Normalized, Streamgraph) and a local GatherLens interaction for targeted control, all implemented in a D3/Angular prototype. A crowdsourced study demonstrates that gatherplots improve accuracy and user confidence over jittered scatterplots, with mode choices aligned to specific tasks. The approach preserves object identity, supports continuous variables through binning, and offers practical advantages for multidimensional exploration, with future work extending gathering principles to parallel coordinates and additional interactions.

Abstract

Scatterplots are a common tool for exploring multidimensional datasets, especially in the form of scatterplot matrices (SPLOMs). However, scatterplots suffer from overplotting when categorical variables are mapped to one or two axes, or the same continuous variable is used for both axes. Previous methods such as histograms or violin plots use aggregation, which makes brushing and linking difficult. To address this, we propose gatherplots, an extension of scatterplots to manage the overplotting problem. Gatherplots are a form of unit visualization, which avoid aggregation and maintain the identity of individual objects to ease visual perception. In gatherplots, every visual mark that maps to the same position coalesces to form a packed entity, thereby making it easier to see the overview of data groupings. The size and aspect ratio of marks can also be changed dynamically to make it easier to compare the composition of different groups. In the case of a categorical variable vs. a categorical variable, we propose a heuristic to decide bin sizes for optimal space usage. To validate our work, we conducted a crowdsourced user study that shows that gatherplots enable people to assess data distribution more quickly and more correctly than when using jittered scatterplots.
Paper Structure (36 sections, 1 equation, 13 figures)

This paper contains 36 sections, 1 equation, 13 figures.

Figures (13)

  • Figure 1: Scatterplot matrix. SPLOM visualizing a car dataset with one continuous variable MPG and one categorical variable Cylinders showing limitations of scatterplots when managing categorical variables. In (a), a scatterplot with the same variable for both axes results in a diagonal line. In (b) and (c), a scatterplot with a continuous vs. a categorical variable results in horizontal or vertical line patterns. In (d), a scatterplot with two categorical variables results in a dot pattern.
  • Figure 2: Cars dataset. Gatherplots showing a dataset related to cars, yielding overplotting in normal scatterplots. The gatherplot in (a) shows Cylinders (categorical) vs. MPG (continuous), highlighting the overall distribution of MPG values of cars with different cylinders. The brackets on the X-axis are used to indicate that the interval within the brackets represent the same value in the data. The gatherplot in (b) shows Cylinders (categorical) vs. Origin (categorical), partitioning the graphical axes into intervals and packing points into groups for each interval. In (c), both X-axis and Y-axis show the same continuous variable (MPG). All these cases would have caused overplotting for a scatterplot, resulting in dot-shaped or line-shaped point patterns where individual points cannot be identified.
  • Figure 3: Main layout modes for gatherplots. (a) absolute mode with constant aspect ratio, which maintains the aspect ratio; (b) normalized mode of (a). The rate of male survivors in each passenger class is not easy to compare. Figure (c) shows the streamgraph mode, where each cluster maintains the number of element in the shorter edge, making it easier to see the distribution of the subgroups along the Y axis.
  • Figure 4: Choosing optimal bin size based on available display space. In (a), there is enough space so that the dot size can be maximized, improving spatial accuracy. In comparison, in (b) the assigned space is small, so the dot size is determined so that the most crowded bin interval will fit within the assigned space. This results in two different overviews even though the two plots have identical aspect ratio.
  • Figure 5: Using gatherplots to manage overplotting. (a) shows a scatterplot with 5,000 random numbers with severe overplotting in the center area. In (b), gathering is applied to create a more organized view. However, the gathering resizes the items so small that it becomes difficult to detect outliers. (c) shows normalized mode, where the outliers are enlarged. This makes identifying the distribution of sparse regions easier.
  • ...and 8 more figures