Table of Contents
Fetching ...

PIMDAL: Mitigating the Memory Bottleneck in Data Analytics using a Real Processing-in-Memory System

Manos Frouzakis, Juan Gómez-Luna, Geraldo F. Oliveira, Mohammad Sadrosadati, Onur Mutlu

TL;DR

This work demonstrates that real-world Processing-in-Memory (PIM) hardware can substantially mitigate memory bottlenecks in data analytics by implementing four core DB operators (selection, aggregation, ordering, and join) on the UPMEM PIM system. The PIMDAL library combines host-side Apache Arrow data management with PIM-core implementations, and uses sorting and hashing primitives to realize efficient operator backends; five TPC-H queries show an average of $3.9\times$ speedup over a high-end CPU, with mixed results against GPU due to inter-core communication and data movement costs. Key contributions include a practical PIM-based design that addresses PIM limitations (low arithmetic performance, explicit memory management, and cross-core communication), a detailed sorting/hash-based operator suite, and host-transfer optimizations (scatter/gather, Arrow memory management, asynchronous transfers). The findings suggest PIM is highly promising for memory-bound analytics and can outperform traditional architectures on select workloads, signaling a path toward more data-centric accelerators in practice.

Abstract

Database Management Systems (DBMSs) are crucial for efficient data management and analytics, and are used in several different application domains. Due to the increasing volume of data a DBMS deals with, current processor-centric architectures (e.g., CPUs, GPUs) suffer from data movement bottlenecks when executing key DBMS operations (e.g., selection, aggregation, ordering, and join). This happens mostly due to the limited memory bandwidth between compute and memory resources. Data-centric architectures like Processing-in-Memory (PIM) are a promising alternative for applications bottlenecked by data, placing compute resources close to where data resides. Previous works have evaluated using PIM for data analytics. However, they either do not use real-world architectures or they consider only a subset of the operators used in analytical queries. This work aims to fully evaluate a data-centric approach to data analytics, by using the real-world UPMEM PIM system. To this end we first present the PIM Data Analytics Library (PIMDAL), which implements four major DB operators: selection, aggregation, ordering and join. Second, we use hardware performance metrics to understand which properties of a PIM system are important for a high-performance implementation. Third, we compare PIMDAL to reference implementations on high-end CPU and GPU systems. Fourth, we use PIMDAL to implement five TPC-H queries to gain insights into analytical queries. We analyze and show how to overcome the three main limitations of the UPMEM system when implementing DB operators: (I) low arithmetic performance, (II) explicit memory management and (III) limited communication between compute units. Our evaluation shows PIMDAL achieves 3.9x the performance of a high-end CPU, on average across the five TPC-H queries.

PIMDAL: Mitigating the Memory Bottleneck in Data Analytics using a Real Processing-in-Memory System

TL;DR

This work demonstrates that real-world Processing-in-Memory (PIM) hardware can substantially mitigate memory bottlenecks in data analytics by implementing four core DB operators (selection, aggregation, ordering, and join) on the UPMEM PIM system. The PIMDAL library combines host-side Apache Arrow data management with PIM-core implementations, and uses sorting and hashing primitives to realize efficient operator backends; five TPC-H queries show an average of speedup over a high-end CPU, with mixed results against GPU due to inter-core communication and data movement costs. Key contributions include a practical PIM-based design that addresses PIM limitations (low arithmetic performance, explicit memory management, and cross-core communication), a detailed sorting/hash-based operator suite, and host-transfer optimizations (scatter/gather, Arrow memory management, asynchronous transfers). The findings suggest PIM is highly promising for memory-bound analytics and can outperform traditional architectures on select workloads, signaling a path toward more data-centric accelerators in practice.

Abstract

Database Management Systems (DBMSs) are crucial for efficient data management and analytics, and are used in several different application domains. Due to the increasing volume of data a DBMS deals with, current processor-centric architectures (e.g., CPUs, GPUs) suffer from data movement bottlenecks when executing key DBMS operations (e.g., selection, aggregation, ordering, and join). This happens mostly due to the limited memory bandwidth between compute and memory resources. Data-centric architectures like Processing-in-Memory (PIM) are a promising alternative for applications bottlenecked by data, placing compute resources close to where data resides. Previous works have evaluated using PIM for data analytics. However, they either do not use real-world architectures or they consider only a subset of the operators used in analytical queries. This work aims to fully evaluate a data-centric approach to data analytics, by using the real-world UPMEM PIM system. To this end we first present the PIM Data Analytics Library (PIMDAL), which implements four major DB operators: selection, aggregation, ordering and join. Second, we use hardware performance metrics to understand which properties of a PIM system are important for a high-performance implementation. Third, we compare PIMDAL to reference implementations on high-end CPU and GPU systems. Fourth, we use PIMDAL to implement five TPC-H queries to gain insights into analytical queries. We analyze and show how to overcome the three main limitations of the UPMEM system when implementing DB operators: (I) low arithmetic performance, (II) explicit memory management and (III) limited communication between compute units. Our evaluation shows PIMDAL achieves 3.9x the performance of a high-end CPU, on average across the five TPC-H queries.

Paper Structure

This paper contains 27 sections, 26 figures, 4 tables.

Figures (26)

  • Figure 1: Roofline model of DB operators running on an Intel Xeon Gold 6226R.
  • Figure 2: Quicksort PIM implementation.
  • Figure 3: Parallelization technique for quicksort based on sort partitioning.
  • Figure 4: Mergesort PIM implementation.
  • Figure 5: Multithreaded PIM implementation of hash partitioning step 1 and 2
  • ...and 21 more figures