Out-of-Core Dimensionality Reduction for Large Data via Out-of-Sample Extensions

Luca Reichmann; David Hägele; Daniel Weiskopf

Out-of-Core Dimensionality Reduction for Large Data via Out-of-Sample Extensions

Luca Reichmann, David Hägele, Daniel Weiskopf

TL;DR

The paper tackles the practical challenge of visualizing very large high-dimensional datasets with dimensionality reduction. It introduces a general out-of-sample (OOS) extension framework that constructs a small reference projection and incrementally projects the remainder in batches, enabling out-of-core processing for DR methods including MDS, PCA, t-SNE, UMAP, and autoencoders. Through a comprehensive evaluation using global and local quality metrics across diverse datasets, the authors analyze how the reference-set size and batch processing affect projection quality and runtime, and compare the OOS approach to large-scale DR methods such as PaCMAP and TriMAP. A large-scale use case demonstrates feasibility on one billion instances, highlighting potential for visual analytics and in-situ analysis, while also noting limitations and directions for future work. Overall, the work provides a practical, framework-level solution to scale DR to data that do not fit in memory, with actionable guidance on trading off reference size, runtime, and projection quality.

Abstract

Dimensionality reduction (DR) is a well-established approach for the visualization of high-dimensional data sets. While DR methods are often applied to typical DR benchmark data sets in the literature, they might suffer from high runtime complexity and memory requirements, making them unsuitable for large data visualization especially in environments outside of high-performance computing. To perform DR on large data sets, we propose the use of out-of-sample extensions. Such extensions allow inserting new data into existing projections, which we leverage to iteratively project data into a reference projection that consists only of a small manageable subset. This process makes it possible to perform DR out-of-core on large data, which would otherwise not be possible due to memory and runtime limitations. For metric multidimensional scaling (MDS), we contribute an implementation with out-of-sample projection capability since typical software libraries do not support it. We provide an evaluation of the projection quality of five common DR algorithms (MDS, PCA, t-SNE, UMAP, and autoencoders) using quality metrics from the literature and analyze the trade-off between the size of the reference set and projection quality. The runtime behavior of the algorithms is also quantified with respect to reference set size, out-of-sample batch size, and dimensionality of the data sets. Furthermore, we compare the out-of-sample approach to other recently introduced DR methods, such as PaCMAP and TriMAP, which claim to handle larger data sets than traditional approaches. To showcase the usefulness of DR on this large scale, we contribute a use case where we analyze ensembles of streamlines amounting to one billion projected instances.

Out-of-Core Dimensionality Reduction for Large Data via Out-of-Sample Extensions

TL;DR

Abstract

Paper Structure (24 sections, 1 equation, 6 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 1 equation, 6 figures, 4 tables, 1 algorithm.

Introduction
Related Work
DR for Large Data
Use and Evaluation of OOS Extensions
Methodology
Computational Framework
Evaluation
Out-of-Sample Extensions
Data Sets
Metrics
Global Metrics
Local Metrics
Computational Complexities
Experiments
Setup and Implementation
...and 9 more sections

Figures (6)

Figure 1: Metric values for increasing reference set size for the EMNIST, Covertype, and Flow Cytometry data sets. The x-axis uses a logarithmic scale for the reference set sizes, and the y-axis displays the corresponding metric values.
Figure 2: Heat maps of the projections of the Flow Cytometry data set with t-SNE and UMAP. The color bar label shows the number of data points in the most populated area.
Figure 3: Runtime (in seconds) needed for the projections for varying reference set sizes. The x-axis uses a logarithmic scale for the reference set sizes. The line colors refer to the dimensionality of the projected data set.
Figure 4: The time necessary to project a point with batches of varying size. The time is measured with the three different reference set sizes per technique (see the legend).
Figure 5: Projections of the Tornado and KDD Cup '99 data sets in the first and second row, respectively. The projections in the first column were created with UMAP and a reference set size of 262,144.
...and 1 more figures

Out-of-Core Dimensionality Reduction for Large Data via Out-of-Sample Extensions

TL;DR

Abstract

Out-of-Core Dimensionality Reduction for Large Data via Out-of-Sample Extensions

Authors

TL;DR

Abstract

Table of Contents

Figures (6)