Table of Contents
Fetching ...

To Measure What Isn't There -- Visual Exploration of Missingness Structures Using Quality Metrics

Sara Johansson Fernstad, Sarah Alsufyani, Silvia Del Din, Alison Yarnall, Lynn Rochester

TL;DR

This work introduces a set of Quality Metrics (QM) to identify and visually analyze structured missingness in high-dimensional data, focusing on Amount Missing ($Q_{AM}$), Joint Missingness ($Q_{JM_{mag}}$, $Q_{JM_{dir}}$, $Q_{JM_{abs}}$), and Conditional Missingness ($Q_{CM_{DiD}}$, $Q_{CM_H}$). The authors formalize these metrics, apply them to synthetic datasets with controlled missingness, and demonstrate their utility across multiple visualization modalities (heatmaps, parallel coordinates, MissiG glyphs, and Cytoscape networks). A real-world ICICLE walking-monitoring case study showcases how QM-guided visual exploration can reveal missingness patterns tied to data collection procedures and participant groups, while also revealing limitations of conditional-missingness metrics in high-joint-missingness settings. Overall, the paper provides a practical framework for diagnosing data quality issues and guiding visualization-driven missing-data analysis, supplemented by openly available materials for replication and testing.

Abstract

This paper contributes a set of quality metrics for identification and visual analysis of structured missingness in high-dimensional data. Missing values in data are a frequent challenge in most data generating domains and may cause a range of analysis issues. Structural missingness in data may indicate issues in data collection and pre-processing, but may also highlight important data characteristics. While research into statistical methods for dealing with missing data are mainly focusing on replacing missing values with plausible estimated values, visualization has great potential to support a more in-depth understanding of missingness structures in data. Nonetheless, while the interest in missing data visualization has increased in the last decade, it is still a relatively overlooked research topic with a comparably small number of publications, few of which address scalability issues. Efficient visual analysis approaches are needed to enable exploration of missingness structures in large and high-dimensional data, and to support informed decision-making in context of potential data quality issues. This paper suggests a set of quality metrics for identification of patterns of interest for understanding of structural missingness in data. These quality metrics can be used as guidance in visual analysis, as demonstrated through a use case exploring structural missingness in data from a real-life walking monitoring study. All supplemental materials for this paper are available at https://doi.org/10.25405/data.ncl.c.7741829.

To Measure What Isn't There -- Visual Exploration of Missingness Structures Using Quality Metrics

TL;DR

This work introduces a set of Quality Metrics (QM) to identify and visually analyze structured missingness in high-dimensional data, focusing on Amount Missing (), Joint Missingness (, , ), and Conditional Missingness (, ). The authors formalize these metrics, apply them to synthetic datasets with controlled missingness, and demonstrate their utility across multiple visualization modalities (heatmaps, parallel coordinates, MissiG glyphs, and Cytoscape networks). A real-world ICICLE walking-monitoring case study showcases how QM-guided visual exploration can reveal missingness patterns tied to data collection procedures and participant groups, while also revealing limitations of conditional-missingness metrics in high-joint-missingness settings. Overall, the paper provides a practical framework for diagnosing data quality issues and guiding visualization-driven missing-data analysis, supplemented by openly available materials for replication and testing.

Abstract

This paper contributes a set of quality metrics for identification and visual analysis of structured missingness in high-dimensional data. Missing values in data are a frequent challenge in most data generating domains and may cause a range of analysis issues. Structural missingness in data may indicate issues in data collection and pre-processing, but may also highlight important data characteristics. While research into statistical methods for dealing with missing data are mainly focusing on replacing missing values with plausible estimated values, visualization has great potential to support a more in-depth understanding of missingness structures in data. Nonetheless, while the interest in missing data visualization has increased in the last decade, it is still a relatively overlooked research topic with a comparably small number of publications, few of which address scalability issues. Efficient visual analysis approaches are needed to enable exploration of missingness structures in large and high-dimensional data, and to support informed decision-making in context of potential data quality issues. This paper suggests a set of quality metrics for identification of patterns of interest for understanding of structural missingness in data. These quality metrics can be used as guidance in visual analysis, as demonstrated through a use case exploring structural missingness in data from a real-life walking monitoring study. All supplemental materials for this paper are available at https://doi.org/10.25405/data.ncl.c.7741829.

Paper Structure

This paper contains 23 sections, 6 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Discrete probability distributions: a) distribution of all recorded values in $\vec{d_k}$, b) distribution in $\vec{d_k}$ of items that are missing in $\vec{d_j}$, c) the difference between the distributions in a) and b) highlighted in grey.
  • Figure 2: The basic structure of MissiG for three or four variables fernstad2021explore. Variable $C$ is selected in \ref{['Fig:MissVisC']} and \ref{['Fig:MissVisD']}, with patterns related to missing in $C$ represented by red in the other variables.
  • Figure 3: Visualization of $BreastCancer_{AM}$. Missing values are represented below the axes in PC, and MissiG glyphs are used to display further missingness structures.
  • Figure 4: $BreastCancer_{JM}$ displayed in heatmap and barchart ordered from left to right by $Q_{AM}$.
  • Figure 5: Network visualization of $BreastCancer_{JM}$ with layout and visual appearance based on JM metrics.
  • ...and 9 more figures