SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch

Ron Shapira Weber; Oren Freifeld

SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch

Ron Shapira Weber, Oren Freifeld

TL;DR

This work introduces a tiled anti-diagonal kernel execution that removes the sequence-length constraint, a log-space back-ward pass that prevents floating-point overflow, and a fused distance-computation mode that eliminates the O(BN M ) intermediate distance tensor, achieving up to 98% memory reduction compared to prior work.

Abstract

We present softdtw-cuda-torch, an open-source PyTorch library for computing Soft Dynamic Time Warping (SoftDTW) on GPUs. Our implementation addresses three key limitations of existing GPU implementations of SoftDTW: a hard sequence-length cap of 1024, numerical instability in the backward pass for small smoothing parameters, and excessive GPU memory consumption from materializing pairwise distance tensors. We introduce (1) tiled anti-diagonal kernel execution that removes the sequence-length constraint, (2) a log-space back-ward pass that prevents floating-point overflow, and (3) a fused distance-computation mode that eliminates the O(BN M ) intermediate distance tensor, achieving up to 98% memory reduction compared to prior work. The library supports arbitrary sequence lengths, full PyTorch autograd integration, and Soft-DTW Barycenter computation. Code is available at https://github.com/BGU-CS-VIL/sdtw-cuda-torch.

SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch

TL;DR

Abstract

Paper Structure (24 sections, 8 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 8 equations, 2 figures, 2 tables, 1 algorithm.

Introduction
Background
Soft-DTW formulation.
Backward pass.
Complexity.
Limitations of Prior GPU Implementations
1. Sequence length cap at 1024.
2. Numerical instability in the backward pass.
3. Full distance tensor materialization.
Our Contributions
Tiled Anti-Diagonal Execution
Log-Space Backward Pass
Fused Distance Computation
Fused vs. Unfused Trade-offs.
Barycenter Computation
...and 9 more sections

Figures (2)

Figure 1: Benchmark results for batch size $B = 32$. Top row: Peak GPU memory (MB) as a function of sequence length $L$ (left, $D = 128$) and feature dimension $D$ (right, $L = 256$). Bottom row: Wall-clock runtime (ms) for the corresponding configurations. Maghoumi's implementation is unavailable for $L > 1024$ (CUDA thread-block limit) and runs out of memory for large configurations; our unfused and fused modes remain operational throughout. The fused mode trades runtime for significant memory savings.
Figure 2: SoftDTW Barycenter on synthetic block-wave data.

SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch

TL;DR

Abstract

SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch

Authors

TL;DR

Abstract

Table of Contents

Figures (2)