Table of Contents
Fetching ...

The Categorical Data Map: A Multidimensional Scaling-Based Approach

Frederik L. Dennig, Lucas Joos, Patrick Paetzold, Daniela Blumberg, Oliver Deussen, Daniel A. Keim, Maximilian T. Fischer

TL;DR

This work introduces the Categorical Data Map, a similarity-based, MDS-driven projection for high-cardinality categorical data. By defining item distance as the number of varying attributes and enriching the layout with background attribute distributions and four subset glyphs, it enables clustering of similar subsets and intuitive navigation. The authors contribute two graph-based fracturedness measures to rank attributes by their impact on cluster cohesion, and they validate the approach through quantitative comparisons to MCA and qualitative expert studies, showing improved scalability and interpretability for large category combinations. The method is demonstrated on real datasets (e.g., Titanic, Mushroom, Property Sales) and is available via an online demonstrator, offering a practical tool for exploratory analysis of complex categorical data.

Abstract

Categorical data does not have an intrinsic definition of distance or order, and therefore, established visualization techniques for categorical data only allow for a set-based or frequency-based analysis, e.g., through Euler diagrams or Parallel Sets, and do not support a similarity-based analysis. We present a novel dimensionality reduction-based visualization for categorical data, which is based on defining the distance of two data items as the number of varying attributes. Our technique enables users to pre-attentively detect groups of similar data items and observe the properties of the projection, such as attributes strongly influencing the embedding. Our prototype visually encodes data properties in an enhanced scatterplot-like visualization, encoding attributes in the background to show the distribution of categories. In addition, we propose two graph-based measures to quantify the plot's visual quality, which rank attributes according to their contribution to cluster cohesion. To demonstrate the capabilities of our similarity-based approach, we compare it to Euler diagrams and Parallel Sets regarding visual scalability and show its benefits through an expert study with five data scientists analyzing the Titanic and Mushroom datasets with up to 23 attributes and 8124 category combinations. Our results indicate that the Categorical Data Map offers an effective analysis method, especially for large datasets with a high number of category combinations.

The Categorical Data Map: A Multidimensional Scaling-Based Approach

TL;DR

This work introduces the Categorical Data Map, a similarity-based, MDS-driven projection for high-cardinality categorical data. By defining item distance as the number of varying attributes and enriching the layout with background attribute distributions and four subset glyphs, it enables clustering of similar subsets and intuitive navigation. The authors contribute two graph-based fracturedness measures to rank attributes by their impact on cluster cohesion, and they validate the approach through quantitative comparisons to MCA and qualitative expert studies, showing improved scalability and interpretability for large category combinations. The method is demonstrated on real datasets (e.g., Titanic, Mushroom, Property Sales) and is available via an online demonstrator, offering a practical tool for exploratory analysis of complex categorical data.

Abstract

Categorical data does not have an intrinsic definition of distance or order, and therefore, established visualization techniques for categorical data only allow for a set-based or frequency-based analysis, e.g., through Euler diagrams or Parallel Sets, and do not support a similarity-based analysis. We present a novel dimensionality reduction-based visualization for categorical data, which is based on defining the distance of two data items as the number of varying attributes. Our technique enables users to pre-attentively detect groups of similar data items and observe the properties of the projection, such as attributes strongly influencing the embedding. Our prototype visually encodes data properties in an enhanced scatterplot-like visualization, encoding attributes in the background to show the distribution of categories. In addition, we propose two graph-based measures to quantify the plot's visual quality, which rank attributes according to their contribution to cluster cohesion. To demonstrate the capabilities of our similarity-based approach, we compare it to Euler diagrams and Parallel Sets regarding visual scalability and show its benefits through an expert study with five data scientists analyzing the Titanic and Mushroom datasets with up to 23 attributes and 8124 category combinations. Our results indicate that the Categorical Data Map offers an effective analysis method, especially for large datasets with a high number of category combinations.
Paper Structure (18 sections, 6 equations, 8 figures, 1 table)

This paper contains 18 sections, 6 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Representation of subsets for a dataset with eight attributes. (a) shows the eight attributes in four segments with the same area while the size encodes the overall subset size. (b) shows a similar glyph, but instead, the size is encoded by a bar at the top, and all glyphs have the same size. (c) Encodes the attributes similar to the area square but is circle-shaped. (d) encodes the size by an arc filled according to the subset size.
  • Figure 2: The fracturedness of attributes differs a lot and can imply an order, i.e., increasing from left to right. The examples are derived from the Titanic dataset Titanic95. The edge-based (i.e., $\mathcal{F}_\textrm{edge}$) and component-based fracturedness (i.e., $\mathcal{F}_\textrm{comp}$) values are provided below for each attribute.
  • Figure 3: We illustrate edge-based fracturedness with a Delaunay triangulation shown in black, and a Voronoi partitioning with cell borders shown in red. The cells are colored according to the categories of an attribute. $v_1$, $v_2$ and $v_3$ are vertices of the Delaunay triangulation. The edge ${v_1, v_2}$ will not contribute to edge-based fracturedness, since it connects cells representing the same category of a given attribute. Edge ${v_2, v_3}$ contributes to edge-based fracturedness because it connects cells representing different categories.
  • Figure 4: We describe component-based fracturedness with a Voronoi partitioning with cell borders shown in red. The associated Delaunay triangulation is shown in black. The cells are colored according to the categories of an attribute. $s_1$ to $s_6$ are six components induced by an attribute through the subgraphs associated with a category. Solid lines connect each subgraph, while dashed lines are not part of any subgraph. With six components $\mathpzc{F}_{comp} = 0.33$ for the attribute (see \ref{['eq:fracturedness']}).
  • Figure 5: Through user selection, the borders of a second attribute can be added to the foreground of the plot, e.g., Purchaser Currently Living In is shown in the background as the primary attribute, and Location of Purchased Property is shown in the foreground.
  • ...and 3 more figures