Table of Contents
Fetching ...

A Data Aggregation Visualization System supported by Processing-in-Memory

Junyoung Kim, Madhulika Balakumar, Kenneth Ross

TL;DR

DIVAN tackles scalable visualization of large datasets by performing many aggregation-driven heatmaps through a frequency-based binning approach that focuses on high-density regions. The system is implemented for both general CPUs and Processing-in-Memory (PIM) architectures, with a novel workload distribution across DPUs to compute all ${\binom{N}{3}}$ 3D aggregates efficiently. Key contributions include the frequency-based binning workflow, approximate binning via histograms to speed preprocessing, end-to-end execution on CPU and PIM, and an image-based visualization recommendation that prioritizes informative views. Experimental results on taxi and flight data demonstrate substantial end-to-end speedups with PIM (45%-64% for 16+ dimensions) and illustrate how DIVAN reveals both expected and surprising patterns in real-world data. The work enables near-interactive exploration of tens of millions of rows across dozens of dimensions, with practical implications for data analysis and decision-making.

Abstract

Data visualization of aggregation queries is one of the most common ways of doing data exploration and data science as it can help identify correlations and patterns in the data. We propose DIVAN, a system that automatically normalizes the one-dimensional axes by frequency to generate large numbers of two-dimensional visualizations. DIVAN normalizes the input data via binning to allocate more pixels to data values that appear more frequently in the dataset. DIVAN can utilize either CPUs or Processing-in-Memory (PIM) architectures to quickly calculate aggregates to support the visualizations. On real world datasets, we show that DIVAN generates visualizations that highlight patterns and correlations, some expected and some unexpected. By using PIM, we can calculate aggregates 45%-64% faster than modern CPUs on large datasets. For use cases with 100 million rows and 32 columns, our system is able to compute 4,960 aggregates (each of size 128x128x128) in about a minute.

A Data Aggregation Visualization System supported by Processing-in-Memory

TL;DR

DIVAN tackles scalable visualization of large datasets by performing many aggregation-driven heatmaps through a frequency-based binning approach that focuses on high-density regions. The system is implemented for both general CPUs and Processing-in-Memory (PIM) architectures, with a novel workload distribution across DPUs to compute all 3D aggregates efficiently. Key contributions include the frequency-based binning workflow, approximate binning via histograms to speed preprocessing, end-to-end execution on CPU and PIM, and an image-based visualization recommendation that prioritizes informative views. Experimental results on taxi and flight data demonstrate substantial end-to-end speedups with PIM (45%-64% for 16+ dimensions) and illustrate how DIVAN reveals both expected and surprising patterns in real-world data. The work enables near-interactive exploration of tens of millions of rows across dozens of dimensions, with practical implications for data analysis and decision-making.

Abstract

Data visualization of aggregation queries is one of the most common ways of doing data exploration and data science as it can help identify correlations and patterns in the data. We propose DIVAN, a system that automatically normalizes the one-dimensional axes by frequency to generate large numbers of two-dimensional visualizations. DIVAN normalizes the input data via binning to allocate more pixels to data values that appear more frequently in the dataset. DIVAN can utilize either CPUs or Processing-in-Memory (PIM) architectures to quickly calculate aggregates to support the visualizations. On real world datasets, we show that DIVAN generates visualizations that highlight patterns and correlations, some expected and some unexpected. By using PIM, we can calculate aggregates 45%-64% faster than modern CPUs on large datasets. For use cases with 100 million rows and 32 columns, our system is able to compute 4,960 aggregates (each of size 128x128x128) in about a minute.

Paper Structure

This paper contains 27 sections, 11 figures, 5 algorithms.

Figures (11)

  • Figure 1: Process of creating 2D images from 3D aggregates
  • Figure 2: Execution time for varying number of dimensions (Taxi dataset)
  • Figure 3: Execution time for varying number of dimensions (Flight dataset)
  • Figure 4: Execution time for various numbers of bins
  • Figure 5: Execution time with various numbers of DPUs
  • ...and 6 more figures

Theorems & Definitions (3)

  • Definition 4.1: Shift
  • Definition 4.2: Equality
  • Definition 4.3: Shift overlap