Table of Contents
Fetching ...

FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms

Atul Shree, Harshith Jupuru

TL;DR

CTC decoding on resource-limited devices suffers from heavy compute and memory demands. FLToP CTC introduces frame-level pruning guided by a relative threshold $R$, performing a two-stage process: expand with the top-$N$ tokens per frame and then prune candidates whose scores fall below $R$ times the top score, aided by a conditional break for simplicity. Key contributions include dynamic frame-level pruning, a platform-agnostic design, and empirical decoder-behavior validation, with LibriSpeech experiments showing large speedups (up to $10.5x$) and memory reductions (up to $2.78x$) while maintaining competitive WER. Overall, FLToP CTC offers a practical, scalable approach to efficient CTC decoding suitable for CPUs, GPUs, and low-resource hardware, enabling real-time ASR across diverse platforms.

Abstract

CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.

FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms

TL;DR

CTC decoding on resource-limited devices suffers from heavy compute and memory demands. FLToP CTC introduces frame-level pruning guided by a relative threshold , performing a two-stage process: expand with the top- tokens per frame and then prune candidates whose scores fall below times the top score, aided by a conditional break for simplicity. Key contributions include dynamic frame-level pruning, a platform-agnostic design, and empirical decoder-behavior validation, with LibriSpeech experiments showing large speedups (up to ) and memory reductions (up to ) while maintaining competitive WER. Overall, FLToP CTC offers a practical, scalable approach to efficient CTC decoding suitable for CPUs, GPUs, and low-resource hardware, enabling real-time ASR across diverse platforms.

Abstract

CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.

Paper Structure

This paper contains 9 sections, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Workflow of FLToP CTC Algorithm
  • Figure 2: Count and Average Emission Scores of choosing a token at specific index (from best beam from all test samples) after sorting the token based on emission scores
  • Figure 3: WER and Time Taken for Decoding by varying the beam size token
  • Figure 4: WER and Time Taken for Decoding by varying the Relative Token Threshold with Top4 approach
  • Figure 5: Box plot for Overall Number of candidates stored in beam search for all time steps across all test samples