Optimizing sDTW for AMD GPUs
Daniel Latta-Lin, Sofia Isadora Padilla Munoz
TL;DR
This work presents a ROCm/HIP-based, AMD GPU-optimized implementation of subsequence Dynamic Time Warping (sDTW). It introduces two kernel modules: a normalizer for z-score standardization and a specialized sDTW kernel that uses per-wavefront processing, warp shuffles for inter-segment communication, and two buffers to transfer state across wavefronts, all operating on float16 with __half2 packing. The approach builds on prior GPU DTW techniques, adapting them to ROCm with careful attention to memory access, synchronization, and segment-width tuning; results show correct outputs against CPU references and substantial throughput improvements, achieving 9.26e-4 Gsps for sDTW and 4.81 Gsps for normalization under a batch of 512 queries of length 2,000 against a 100,000-length reference. The work demonstrates portability of DTW optimizations to AMD hardware and outlines potential future gains through quantization and pruning strategies to further boost performance. The practical impact lies in enabling real-time, large-scale sequence alignment on AMD GPUs for applications like finance, genomics, and signal processing.
Abstract
Subsequence Dynamic Time Warping (sDTW) is the metric of choice when performing many sequence matching and alignment tasks. While sDTW is flexible and accurate, it is neither simple nor fast to compute; significant research effort has been spent devising parallel implementations on the GPU that leverage efficient memory access and computation patterns, as well as features offered by specific vendors and architectures (notably NVIDIA's). We present an implementation of sDTW on AMD hardware using HIP and ROCm. Our implementation employs well-known parallel patterns, as well as lower-level features offered by ROCm. We use shuffling for intra-wavefront communication and shared memory to transfer data between consecutive wavefronts. By constraining the input data to batches of 512 queries of length 2,000, we optimized for peak performance the width of reference elements operated on by a single thread.
