Table of Contents
Fetching ...

Dimensionality Reduction Considered Harmful (Some of the Time)

Hyeon Jeon

TL;DR

This work investigates how dimensionality reduction (DR) orderences in visual analytics can produce unreliable conclusions and proposes concrete remedies. It identifies three core reliability challenges—misuse of t-SNE/UMAP for inappropriate tasks, hyperparameter cherry-picking, and erroneous interactions—driven by distortion-prone projections and biased evaluation metrics. The contributions include (i) Label-Trustworthiness and Label-Continuity for label-based evaluation that mitigates overemphasis on class separability, (ii) a dataset-adaptive DR optimization workflow using structural complexity metrics (Pds and Mnc) to accelerate hyperparameter search and selection, and (iii) distortion-aware brushing to robustly locate high-dimensional clusters despite projection distortions. Together, these developments aim to make DR-enabled visual analytics more trustworthy, reproducible, and efficient, with practical impact on how practitioners select DR techniques, tune parameters, and interact with projections.

Abstract

Visual analytics now plays a central role in decision-making across diverse disciplines, but it can be unreliable: the knowledge or insights derived from the analysis may not accurately reflect the underlying data. In this dissertation, we improve the reliability of visual analytics with a focus on dimensionality reduction (DR). DR techniques enable visual analysis of high-dimensional data by reducing it to two or three dimensions, but they inherently introduce errors that can compromise the reliability of visual analytics. To this end, I investigate reliability challenges that practitioners face when using DR for visual analytics. Then, I propose technical solutions to address these challenges, including new evaluation metrics, optimization strategies, and interaction techniques. We conclude the thesis by discussing how our contributions lay the foundation for achieving more reliable visual analytics practices.

Dimensionality Reduction Considered Harmful (Some of the Time)

TL;DR

This work investigates how dimensionality reduction (DR) orderences in visual analytics can produce unreliable conclusions and proposes concrete remedies. It identifies three core reliability challenges—misuse of t-SNE/UMAP for inappropriate tasks, hyperparameter cherry-picking, and erroneous interactions—driven by distortion-prone projections and biased evaluation metrics. The contributions include (i) Label-Trustworthiness and Label-Continuity for label-based evaluation that mitigates overemphasis on class separability, (ii) a dataset-adaptive DR optimization workflow using structural complexity metrics (Pds and Mnc) to accelerate hyperparameter search and selection, and (iii) distortion-aware brushing to robustly locate high-dimensional clusters despite projection distortions. Together, these developments aim to make DR-enabled visual analytics more trustworthy, reproducible, and efficient, with practical impact on how practitioners select DR techniques, tune parameters, and interact with projections.

Abstract

Visual analytics now plays a central role in decision-making across diverse disciplines, but it can be unreliable: the knowledge or insights derived from the analysis may not accurately reflect the underlying data. In this dissertation, we improve the reliability of visual analytics with a focus on dimensionality reduction (DR). DR techniques enable visual analysis of high-dimensional data by reducing it to two or three dimensions, but they inherently introduce errors that can compromise the reliability of visual analytics. To this end, I investigate reliability challenges that practitioners face when using DR for visual analytics. Then, I propose technical solutions to address these challenges, including new evaluation metrics, optimization strategies, and interaction techniques. We conclude the thesis by discussing how our contributions lay the foundation for achieving more reliable visual analytics practices.

Paper Structure

This paper contains 375 sections, 33 equations, 38 figures, 10 tables.

Figures (38)

  • Figure 1: Three different perspectives in defining distortions in DR projections (\ref{['sec:distortionsproj']}). Each type of distortion is measured with different sets of DR evaluation metrics.
  • Figure 2: $t$-SNE projections (right) of a 2D dataset (left) with different perplexity values. The resulting projections fail to faithfully depict the structure of the original data and also show varying patterns by hyperparameter.
  • Figure 3: The illustration of the workflow model. The model explains how an analyst and a machine interact while conducting visual analytics using DR. Each stage of visual analytics executed by analysts and machines is represented by red and blue rectangles, respectively, and the input and output of each stage are designated by arrows.
  • Figure 4: Illustrations of the analytic tasks using DR and their alignment to local and global DR techniques. Our literature review identifies seven types of analytic tasks using DR.
  • Figure 5: The trend of the accumulated number of papers that use (a) or misuse (b) four major DR techniques. We collect papers published from 2008, the year t-SNE is introduced. Note that UMAP's data also starts from the year it is released (2018).
  • ...and 33 more figures