Table of Contents
Fetching ...

SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch

Ron Shapira Weber, Oren Freifeld

TL;DR

This work introduces a tiled anti-diagonal kernel execution that removes the sequence-length constraint, a log-space back-ward pass that prevents floating-point overflow, and a fused distance-computation mode that eliminates the O(BN M ) intermediate distance tensor, achieving up to 98% memory reduction compared to prior work.

Abstract

We present softdtw-cuda-torch, an open-source PyTorch library for computing Soft Dynamic Time Warping (SoftDTW) on GPUs. Our implementation addresses three key limitations of existing GPU implementations of SoftDTW: a hard sequence-length cap of 1024, numerical instability in the backward pass for small smoothing parameters, and excessive GPU memory consumption from materializing pairwise distance tensors. We introduce (1) tiled anti-diagonal kernel execution that removes the sequence-length constraint, (2) a log-space back-ward pass that prevents floating-point overflow, and (3) a fused distance-computation mode that eliminates the O(BN M ) intermediate distance tensor, achieving up to 98% memory reduction compared to prior work. The library supports arbitrary sequence lengths, full PyTorch autograd integration, and Soft-DTW Barycenter computation. Code is available at https://github.com/BGU-CS-VIL/sdtw-cuda-torch.

SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch

TL;DR

This work introduces a tiled anti-diagonal kernel execution that removes the sequence-length constraint, a log-space back-ward pass that prevents floating-point overflow, and a fused distance-computation mode that eliminates the O(BN M ) intermediate distance tensor, achieving up to 98% memory reduction compared to prior work.

Abstract

We present softdtw-cuda-torch, an open-source PyTorch library for computing Soft Dynamic Time Warping (SoftDTW) on GPUs. Our implementation addresses three key limitations of existing GPU implementations of SoftDTW: a hard sequence-length cap of 1024, numerical instability in the backward pass for small smoothing parameters, and excessive GPU memory consumption from materializing pairwise distance tensors. We introduce (1) tiled anti-diagonal kernel execution that removes the sequence-length constraint, (2) a log-space back-ward pass that prevents floating-point overflow, and (3) a fused distance-computation mode that eliminates the O(BN M ) intermediate distance tensor, achieving up to 98% memory reduction compared to prior work. The library supports arbitrary sequence lengths, full PyTorch autograd integration, and Soft-DTW Barycenter computation. Code is available at https://github.com/BGU-CS-VIL/sdtw-cuda-torch.
Paper Structure (24 sections, 8 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 8 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: Benchmark results for batch size $B = 32$. Top row: Peak GPU memory (MB) as a function of sequence length $L$ (left, $D = 128$) and feature dimension $D$ (right, $L = 256$). Bottom row: Wall-clock runtime (ms) for the corresponding configurations. Maghoumi's implementation is unavailable for $L > 1024$ (CUDA thread-block limit) and runs out of memory for large configurations; our unfused and fused modes remain operational throughout. The fused mode trades runtime for significant memory savings.
  • Figure 2: SoftDTW Barycenter on synthetic block-wave data.