Table of Contents
Fetching ...

Piccolo: Large-Scale Graph Processing with Fine-Grained In-Memory Scatter-Gather

Changmin Shin, Jaeyong Song, Hongsun Jang, Dogeun Kim, Jun Sung, Taehee Kwon, Jae Hyung Ju, Frank Liu, Yeonkyu Choi, Jinho Lee

TL;DR

Piccolo tackles the memory-bound challenge of large-scale graph processing by introducing function-in-memory scatter-gather (Piccolo-FIM) inside DRAM and a fine-grained on-chip cache (Piccolo-cache) paired with a collection-extended MSHR. By avoiding arithmetic units in memory and enabling deterministic, row-localized in-DRAM operations, Piccolo leverages DDR’s internal bandwidth while maintaining tiling advantages through cache redesign. Across extensive benchmarks, it delivers a geometric mean speedup of 1.62x (up to 3.28x) and up to 59.7% energy reduction relative to strong baselines, with FPGA emulation validating DDR compatibility. The work demonstrates a practical path to significantly improving memory efficiency for graph workloads on conventional memory platforms, with potential applicability to OLAP and other fine-grained access domains.

Abstract

Graph processing requires irregular, fine-grained random access patterns incompatible with contemporary off-chip memory architecture, leading to inefficient data access. This inefficiency makes graph processing an extremely memory-bound application. Because of this, existing graph processing accelerators typically employ a graph tiling-based or processing-in-memory (PIM) approach to relieve the memory bottleneck. In the tiling-based approach, a graph is split into chunks that fit within the on-chip cache to maximize data reuse. In the PIM approach, arithmetic units are placed within memory to perform operations such as reduction or atomic addition. However, both approaches have several limitations, especially when implemented on current memory standards (i.e., DDR). Because the access granularity provided by DDR is much larger than that of the graph vertex property data, much of the bandwidth and cache capacity are wasted. PIM is meant to alleviate such issues, but it is difficult to use in conjunction with the tiling-based approach, resulting in a significant disadvantage. Furthermore, placing arithmetic units inside a memory chip is expensive, thereby supporting multiple types of operation is thought to be impractical. To address the above limitations, we present Piccolo, an end-to-end efficient graph processing accelerator with fine-grained in-memory random scatter-gather. Instead of placing expensive arithmetic units in off-chip memory, Piccolo focuses on reducing the off-chip traffic with non-arithmetic function-in-memory of random scatter-gather. To fully benefit from in-memory scatter-gather, Piccolo redesigns the cache and MHA of the accelerator such that it can enjoy both the advantage of tiling and in-memory operations. Piccolo achieves a maximum speedup of 3.28$\times$ and a geometric mean speedup of 1.62$\times$ across various and extensive benchmarks.

Piccolo: Large-Scale Graph Processing with Fine-Grained In-Memory Scatter-Gather

TL;DR

Piccolo tackles the memory-bound challenge of large-scale graph processing by introducing function-in-memory scatter-gather (Piccolo-FIM) inside DRAM and a fine-grained on-chip cache (Piccolo-cache) paired with a collection-extended MSHR. By avoiding arithmetic units in memory and enabling deterministic, row-localized in-DRAM operations, Piccolo leverages DDR’s internal bandwidth while maintaining tiling advantages through cache redesign. Across extensive benchmarks, it delivers a geometric mean speedup of 1.62x (up to 3.28x) and up to 59.7% energy reduction relative to strong baselines, with FPGA emulation validating DDR compatibility. The work demonstrates a practical path to significantly improving memory efficiency for graph workloads on conventional memory platforms, with potential applicability to OLAP and other fine-grained access domains.

Abstract

Graph processing requires irregular, fine-grained random access patterns incompatible with contemporary off-chip memory architecture, leading to inefficient data access. This inefficiency makes graph processing an extremely memory-bound application. Because of this, existing graph processing accelerators typically employ a graph tiling-based or processing-in-memory (PIM) approach to relieve the memory bottleneck. In the tiling-based approach, a graph is split into chunks that fit within the on-chip cache to maximize data reuse. In the PIM approach, arithmetic units are placed within memory to perform operations such as reduction or atomic addition. However, both approaches have several limitations, especially when implemented on current memory standards (i.e., DDR). Because the access granularity provided by DDR is much larger than that of the graph vertex property data, much of the bandwidth and cache capacity are wasted. PIM is meant to alleviate such issues, but it is difficult to use in conjunction with the tiling-based approach, resulting in a significant disadvantage. Furthermore, placing arithmetic units inside a memory chip is expensive, thereby supporting multiple types of operation is thought to be impractical. To address the above limitations, we present Piccolo, an end-to-end efficient graph processing accelerator with fine-grained in-memory random scatter-gather. Instead of placing expensive arithmetic units in off-chip memory, Piccolo focuses on reducing the off-chip traffic with non-arithmetic function-in-memory of random scatter-gather. To fully benefit from in-memory scatter-gather, Piccolo redesigns the cache and MHA of the accelerator such that it can enjoy both the advantage of tiling and in-memory operations. Piccolo achieves a maximum speedup of 3.28 and a geometric mean speedup of 1.62 across various and extensive benchmarks.

Paper Structure

This paper contains 32 sections, 20 figures, 2 tables, 1 algorithm.

Figures (20)

  • Figure 1: Piccolo overview compared to existing graph processing accelerators.
  • Figure 2: (a) Random accesses pattern and (b) tiling method in graph processing.
  • Figure 3: Motivational experiment on BFS algorithm. Existing accelerators still suffer from unnecessary accesses due to fine-grained random access, even with perfect tiling, which brings full cache hits.
  • Figure 4: Piccolo architecture for (a) gather and (b) scatter operations. Shaded boxes depict the newly added modules on top of conventional DRAM.
  • Figure 5: Implementations of 4MB and eight-way cache for 8B granularity data access with 48bit addressing. (a) 8B line cache. (b) Piccolo-cache. While the 8B line cache suffers from the significant tag overhead, Piccolo addresses it with Piccolo-cache.
  • ...and 15 more figures