Table of Contents
Fetching ...

Moonshine: Speech Recognition for Live Transcription and Voice Commands

Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, Pete Warden

TL;DR

Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets, highlighting Moonshine's potential for real-time and resource-constrained applications.

Abstract

This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings. The model is trained on speech segments of various lengths, but without using zero-padding, leading to greater efficiency for the encoder during inference time. When benchmarked against OpenAI's Whisper tiny-en, Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets. These results highlight Moonshine's potential for real-time and resource-constrained applications.

Moonshine: Speech Recognition for Live Transcription and Voice Commands

TL;DR

Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets, highlighting Moonshine's potential for real-time and resource-constrained applications.

Abstract

This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings. The model is trained on speech segments of various lengths, but without using zero-padding, leading to greater efficiency for the encoder during inference time. When benchmarked against OpenAI's Whisper tiny-en, Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets. These results highlight Moonshine's potential for real-time and resource-constrained applications.

Paper Structure

This paper contains 12 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: GFLOPS required by Whisper tiny.en for decoding as a function of input audio duration. OpenAI's Whisper models use a fixed-length encoder that requires any audio less than 30 seconds to be zero-padded; testing with a hypothetical variable-length encoder shows that we can attain significant speed-ups by removing this requirement.
  • Figure 2: Variations of absolute position embeddings usage and the corresponding WER produced by Whisper tiny.en (test.clean split of the Librispeech dataset). Simply adapting Whisper's inference code to avoid encoding of fixed-length audio (as in the center and right-most columns) introduces significant increases in WER, motivating our development of new models with variable-length encoding.
  • Figure 3: Moonshine's model architecture.
  • Figure 4: Distribution of training instance durations after combining open and internally-prepared datasets. A slightly bimodal distribution results from our preprocessing procedure, which assembles successive audio segments into instances between 4 and 30 seconds in length.
  • Figure 5: Left: Word Error Rates (WER) across various ranges of input audio duration. Right: Speed-up in decoding time of Moonshine Base over Whisper base.en across various ranges of input audio duration.
  • ...and 1 more figures