FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms
Atul Shree, Harshith Jupuru
TL;DR
CTC decoding on resource-limited devices suffers from heavy compute and memory demands. FLToP CTC introduces frame-level pruning guided by a relative threshold $R$, performing a two-stage process: expand with the top-$N$ tokens per frame and then prune candidates whose scores fall below $R$ times the top score, aided by a conditional break for simplicity. Key contributions include dynamic frame-level pruning, a platform-agnostic design, and empirical decoder-behavior validation, with LibriSpeech experiments showing large speedups (up to $10.5x$) and memory reductions (up to $2.78x$) while maintaining competitive WER. Overall, FLToP CTC offers a practical, scalable approach to efficient CTC decoding suitable for CPUs, GPUs, and low-resource hardware, enabling real-time ASR across diverse platforms.
Abstract
CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.
