Table of Contents
Fetching ...

Lens functions for exploring UMAP Projections with Domain Knowledge

Daniel M. Bot, Jan Aerts

TL;DR

The paper addresses the challenge of extracting domain-relevant patterns from UMAP projections by introducing three lens functions that modify graph connectivity to reflect domain signals. It operationalizes these lenses as Global Lens, Global Mask, and Local Mask, preserving the initial layout as a stable starting point while reconfiguring edges to reveal structure aligned with specific questions. Two real-world use cases—Breast Cancer Gene Expression and Air Quality—demonstrate how lens-enabled projections expose relations among genes and temporal/pollutant patterns, complemented by a synthetic benchmark that characterizes computational costs. The authors provide an open-source Python package to make these lensing techniques accessible for interactive exploration, highlighting the practical impact of domain-knowledge guided visualization in high-dimensional data analysis.

Abstract

Dimensionality reduction algorithms are often used to visualise high-dimensional data. Previously, studies have used prior information to enhance or suppress expected patterns in projections. In this paper, we adapt such techniques for domain knowledge guided interactive exploration. Inspired by Mapper and STAD, we present three types of lens functions for UMAP, a state-of-the-art dimensionality reduction algorithm. Lens functions enable analysts to adapt projections to their questions, revealing otherwise hidden patterns. They filter the modelled connectivity to explore the interaction between manually selected features and the data's structure, creating configurable perspectives each potentially revealing new insights. The effectiveness of the lens functions is demonstrated in two use cases and their computational cost is analysed in a synthetic benchmark. Our implementation is available in an open-source Python package: https://github.com/vda-lab/lensed_umap.

Lens functions for exploring UMAP Projections with Domain Knowledge

TL;DR

The paper addresses the challenge of extracting domain-relevant patterns from UMAP projections by introducing three lens functions that modify graph connectivity to reflect domain signals. It operationalizes these lenses as Global Lens, Global Mask, and Local Mask, preserving the initial layout as a stable starting point while reconfiguring edges to reveal structure aligned with specific questions. Two real-world use cases—Breast Cancer Gene Expression and Air Quality—demonstrate how lens-enabled projections expose relations among genes and temporal/pollutant patterns, complemented by a synthetic benchmark that characterizes computational costs. The authors provide an open-source Python package to make these lensing techniques accessible for interactive exploration, highlighting the practical impact of domain-knowledge guided visualization in high-dimensional data analysis.

Abstract

Dimensionality reduction algorithms are often used to visualise high-dimensional data. Previously, studies have used prior information to enhance or suppress expected patterns in projections. In this paper, we adapt such techniques for domain knowledge guided interactive exploration. Inspired by Mapper and STAD, we present three types of lens functions for UMAP, a state-of-the-art dimensionality reduction algorithm. Lens functions enable analysts to adapt projections to their questions, revealing otherwise hidden patterns. They filter the modelled connectivity to explore the interaction between manually selected features and the data's structure, creating configurable perspectives each potentially revealing new insights. The effectiveness of the lens functions is demonstrated in two use cases and their computational cost is analysed in a synthetic benchmark. Our implementation is available in an open-source Python package: https://github.com/vda-lab/lensed_umap.
Paper Structure (25 sections, 3 equations, 5 figures, 1 table)

This paper contains 25 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the three lens types. All three lens types operate on an initial UMAP model, in this case constructed from a dataset with two spatial variables (1). The initial model does not reveal local lens extrema in its connectivity or layout, i.e., observations with low lens values (red) are connected and located near observations with high lens values (blue). The lens types filter the initial model's edges to separate observations that differ in the lens dimension. In the visualisations, edges that are kept are shown in black and edges that are removed are shown in red. How the lens types update the initial model differs: (a) The global lens divides a single lens dimension---shown by horizontally ordered data points---into non-overlapping segments (2) and only keeps the initial model's edges between points in the same or neighbouring segments (3). (b) The global mask constructs a $k_{mask}$-nearest neighbour network over one or more lens dimensions (2) and only keeps the initial model's edges that also exist in the mask network (3). (c) The local mask computes the distance in one or more lens dimensions between points connected in the initial model (2) and only keeps the $k_{mask}$ shortest ones for each point (3). All three lens types compute a layout for their updated model using the initial model's layout as starting point (4). The resulting embeddings reveal local extrema in the lens dimension.
  • Figure 2: (Lensed) UMAP embeddings for the NKI dataset vantVeer2002nki. (a) UMAP embedding (correlation distance, 30 nearest neighbours) coloured by survival. Contrasting patients within the grey dotted ellipse identifies the ESR1 gene (b). (c) A global lens with three segments separates patients by their survival state, indicated by the coloured rectangles. Contrasting patients by survival state within the low ESR1 community identifies the CSTA gene. A local mask (10 neighbours) over CSTA reveals how CSTA varies over the manifold (d) coloured by survival state, (e) coloured by ESR1, (f) coloured by CSTA. The grey dotted ellipse indicates a low ESR1, high CSTA region with an abundance of 'relapse free' patients.
  • Figure 3: (Lensed) UMAP embeddings for the Air Quality dataset airdata. (a)-(c) Default UMAP embedding (cosine distance, 50-nearest neighbours) shown by the model's edges and points coloured by year and features, respectively. (d)-(f) The embedding after applying a global lens over the year dimensions (24 regular segments), drawn as before. (g-i) The embedding after applying a local mask (20 neighbours) over the SO$_2$ dimension, drawn as before.
  • Figure 4: (Lensed) UMAP embeddings for the Air Quality dataset airdata coloured to summarise the highest feature over the manifold inspired by silva2015featuresthijssen2023explainingprojections. (a) Default UMAP (cosine distance 50-nearest-neighbors), (b) a global lens over the year dimensions (24 regular segments), and (c) a local mask (20 neighbours) over the SO$_2$ dimension. Feature values were normalised with a robust z-score enabling direct comparison of their values. The figures were created using Datashader's categorical shading that blends hues depending on the category values in each pixel datashader.
  • Figure 5: Benchmark compute times (µs) excluding the embedding step and mask model computation. A linear regression line with its 95% interval---relating compute time to the initial UMAP model's edge count---is shown for each dataset size (100, 1000, 10.000, 100.000 points), lens types, and lens parameter value. (a) The global lens with varied discretisation strategy over 3, 6, 12, and 24 segments. (b) The global mask with 20, 40, 80, and 160 mask neighbours. (c) The local mask with 5, 10, 20, and 40 mask neighbours.