Table of Contents
Fetching ...

Fine-Grained Vectorized Merge Sorting on RISC-V: From Register to Cache

Jin Zhang, Jincheng Zhou, Xiang Zhang, Di Ma, Chunye Gong

TL;DR

This work presents RVMS, a fine-grained vectorized merge sort for RISC-V with RVV, addressing shuffle- and cache-related bottlenecks across register-level and cache-aware phases. It introduces a register-strided transpose to proxy data shuffle, a hybrid merging network to minimize vector shuffle, a half-merge strategy to balance in-place and naïve merges, and an asymmetric multi-way input merging network to boost throughput. Together, these components yield substantial gains, including an overall 36% improvement over a baseline, and speedups of up to 1.85x over std::sort at certain scales, demonstrating the practicality of RVV-aware, cache-conscious sorting. The approach highlights how careful architectural-algorithm co-design can unlock efficient sorting on modern vector-enabled RISC-V cores, with implications for high-performance data processing on similar architectures.

Abstract

Merge sort as a divide-sort-merge paradigm has been widely applied in computer science fields. As modern reduced instruction set computing architectures like the fifth generation (RISC-V) regard multiple registers as a vector register group for wide instruction parallelism, optimizing merge sort with this vectorized property is becoming increasingly common. In this paper, we overhaul the divide-sort-merge paradigm, from its register-level sort to the cache-aware merge, to develop a fine-grained RISC-V vectorized merge sort (RVMS). From the register-level view, the inline vectorized transpose instruction is missed in RISC-V, so implementing it efficiently is non-trivial. Besides, the vectorized comparisons do not always work well in the merging networks. Both issues primarily stem from the expensive data shuffle instruction. To bypass it, RVMS strides to take register data as the proxy of data shuffle to accelerate the transpose operation, and meanwhile replaces vectorized comparisons with scalar cousin for more light real value swap. On the other hand, as cache-aware merge makes larger data merge in the cache, most merge schemes have two drawbacks: the in-cache merge usually has low cache utilization, while the out-of-cache merging network remains an ineffectively symmetric structure. To this end, we propose the half-merge scheme to employ the auxiliary space of in-place merge to halve the footprint of naive merge sort, and meanwhile copy one sequence to this space to avoid the former data exchange. Furthermore, an asymmetric merging network is developed to adapt to two different input sizes.

Fine-Grained Vectorized Merge Sorting on RISC-V: From Register to Cache

TL;DR

This work presents RVMS, a fine-grained vectorized merge sort for RISC-V with RVV, addressing shuffle- and cache-related bottlenecks across register-level and cache-aware phases. It introduces a register-strided transpose to proxy data shuffle, a hybrid merging network to minimize vector shuffle, a half-merge strategy to balance in-place and naïve merges, and an asymmetric multi-way input merging network to boost throughput. Together, these components yield substantial gains, including an overall 36% improvement over a baseline, and speedups of up to 1.85x over std::sort at certain scales, demonstrating the practicality of RVV-aware, cache-conscious sorting. The approach highlights how careful architectural-algorithm co-design can unlock efficient sorting on modern vector-enabled RISC-V cores, with implications for high-performance data processing on similar architectures.

Abstract

Merge sort as a divide-sort-merge paradigm has been widely applied in computer science fields. As modern reduced instruction set computing architectures like the fifth generation (RISC-V) regard multiple registers as a vector register group for wide instruction parallelism, optimizing merge sort with this vectorized property is becoming increasingly common. In this paper, we overhaul the divide-sort-merge paradigm, from its register-level sort to the cache-aware merge, to develop a fine-grained RISC-V vectorized merge sort (RVMS). From the register-level view, the inline vectorized transpose instruction is missed in RISC-V, so implementing it efficiently is non-trivial. Besides, the vectorized comparisons do not always work well in the merging networks. Both issues primarily stem from the expensive data shuffle instruction. To bypass it, RVMS strides to take register data as the proxy of data shuffle to accelerate the transpose operation, and meanwhile replaces vectorized comparisons with scalar cousin for more light real value swap. On the other hand, as cache-aware merge makes larger data merge in the cache, most merge schemes have two drawbacks: the in-cache merge usually has low cache utilization, while the out-of-cache merging network remains an ineffectively symmetric structure. To this end, we propose the half-merge scheme to employ the auxiliary space of in-place merge to halve the footprint of naive merge sort, and meanwhile copy one sequence to this space to avoid the former data exchange. Furthermore, an asymmetric merging network is developed to adapt to two different input sizes.
Paper Structure (16 sections, 10 figures, 6 tables)

This paper contains 16 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: The merge sort pipeline and some current existing problems: (1) missing the economic in-place data shuffle instruction, (2) applying expensive vectorized comparisons of the odd-even merging network for register-level sort, (3) inefficient utilization of short-supply cache resource, and (4) incompatibility between asymmetric inputs and symmetric merging network structure.
  • Figure 2: The workflow of the register-level sort ($H$ = 4), where each square represents a data item, with darker cells indicating larger values.
  • Figure 3: The proposed merge sort workflow with two core parts, i.e., the register-level sort and the cache-aware merge. The bottom part of this figure presents our improved methods for the aforementioned four questions.
  • Figure 4: Three transpose implementations: transpose_v0 (two shuffle operations), transpose_v1 (memory strided operation), and transpose_v2 (register strided operation). The code for transpose_v0 is overly complex, so we will not display the complete code.
  • Figure 5: The different hybrid strategy in 16-element bitonic and odd-even merging network, with blue and black rectangles respectively representing vectorized and serial comparisons.
  • ...and 5 more figures