Table of Contents
Fetching ...

Super Monotonic Alignment Search

Junhyeok Lee, Hyeongju Kim

TL;DR

Monotonic Alignment Search (MAS) is a core bottleneck in self-supervised TTS due to its $O(T \\times S)$ dynamic-programming complexity. The authors introduce Super-MAS, a Triton GPU kernel and PyTorch JIT scripts that parallelize MAS along the text length and perform in-place computation on the log-likelihood matrix, eliminating inter-device copies. They demonstrate substantial speedups of at least 19x and up to 72x over the original Cython implementation, translating into practical training-time reductions and enabling scalable training for longer sequences. This work highlights the value of GPU-centric kernel design for alignment tasks in TTS/ASR and points to future gains from kernel fusion and broader application to alignment in speech systems.

Abstract

Monotonic alignment search (MAS), introduced by Glow-TTS, is one of the most popular algorithm in text-to-speech to estimate unknown alignments between text and speech. Since this algorithm needs to search for the most probable alignment with dynamic programming by caching all possible paths, the time complexity of the algorithm is $O(T \times S)$, where $T$ is the length of text and $S$ is the length of speech representation. The authors of Glow-TTS run this algorithm on CPU, and while they mentioned it is difficult to parallelize, we found that MAS can be parallelized in text length dimension and CPU execution consumes an inordinate amount of time for inter-device copy. Therefore, we implemented a Triton kernel and PyTorch JIT script to accelerate MAS on GPU without inter-device copy. As a result, Super-MAS Triton kernel is up to 72 times faster in the extreme-length case. The code is available at https://github.com/supertone-inc/super-monotonic-align.

Super Monotonic Alignment Search

TL;DR

Monotonic Alignment Search (MAS) is a core bottleneck in self-supervised TTS due to its dynamic-programming complexity. The authors introduce Super-MAS, a Triton GPU kernel and PyTorch JIT scripts that parallelize MAS along the text length and perform in-place computation on the log-likelihood matrix, eliminating inter-device copies. They demonstrate substantial speedups of at least 19x and up to 72x over the original Cython implementation, translating into practical training-time reductions and enabling scalable training for longer sequences. This work highlights the value of GPU-centric kernel design for alignment tasks in TTS/ASR and points to future gains from kernel fusion and broader application to alignment in speech systems.

Abstract

Monotonic alignment search (MAS), introduced by Glow-TTS, is one of the most popular algorithm in text-to-speech to estimate unknown alignments between text and speech. Since this algorithm needs to search for the most probable alignment with dynamic programming by caching all possible paths, the time complexity of the algorithm is , where is the length of text and is the length of speech representation. The authors of Glow-TTS run this algorithm on CPU, and while they mentioned it is difficult to parallelize, we found that MAS can be parallelized in text length dimension and CPU execution consumes an inordinate amount of time for inter-device copy. Therefore, we implemented a Triton kernel and PyTorch JIT script to accelerate MAS on GPU without inter-device copy. As a result, Super-MAS Triton kernel is up to 72 times faster in the extreme-length case. The code is available at https://github.com/supertone-inc/super-monotonic-align.
Paper Structure (7 sections, 2 figures, 1 table, 2 algorithms)

This paper contains 7 sections, 2 figures, 1 table, 2 algorithms.

Figures (2)

  • Figure 1: Top Left: Memory hierarchy with bandwidth & memory size of Ampere architecture-based GPUs nvidia_a100nvidia_ampereflashattn. Top Right: Cython implementation of MAS including nested loops and inter-device copy. Bottom: Triton kernel implementation of MAS without nested loops or inter-device copy. The x-axis represents the speech length domain, while the y-axis represents the text length domain. Black block refers to the maximum negative value. The batch domain is not included for simplicity since both implementations run it in parallel.
  • Figure 2: MAS benchmark results in linear scale (left) and log scale (right). The batch size is fixed as 32, and the speech length is set to four times of text length.