Table of Contents
Fetching ...

Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST

Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nikolay Karpov, Jagadeesh Balam, Boris Ginsburg

TL;DR

The work presents Canary-1B-v2 and Parakeet-TDT-0.6B-v3, highly efficient multilingual ASR and AST models built on a FastConformer encoder and an experimental nGPT encoder. It introduces a two-stage pre-training with dynamic data balancing, a unified 25-language tokenizer, and non-speech data to mitigate hallucinations, plus a NeMo Forced Aligner–based timestamping pipeline. Across 25 languages, the models achieve competitive or superior ASR/AST performance with substantially higher throughput than larger baselines, demonstrating strong multilingual coverage and robustness to noisy conditions. The paper also investigates long-form inference via chunking and positional encodings (RoPE vs ALiBi) and releases Parakeet-TDT-0.6B-v3 to extend multilingual ASR to resource-constrained settings.

Abstract

This report introduces Canary-1B-v2, a fast, robust multilingual model for Automatic Speech Recognition (ASR) and Speech-to-Text Translation (AST). Built with a FastConformer encoder and Transformer decoder, it supports 25 languages primarily European. The model was trained on 1.7M hours of total data samples, including Granary and NeMo ASR Set 3.0, with non-speech audio added to reduce hallucinations for ASR and AST. We describe its two-stage pre-training and fine-tuning process with dynamic data balancing, as well as experiments with an nGPT encoder. Results show nGPT scales well with massive data, while FastConformer excels after fine-tuning. For timestamps, Canary-1B-v2 uses the NeMo Forced Aligner (NFA) with an auxiliary CTC model, providing reliable segment-level timestamps for ASR and AST. Evaluations show Canary-1B-v2 outperforms Whisper-large-v3 on English ASR while being 10x faster, and delivers competitive multilingual ASR and AST performance against larger models like Seamless-M4T-v2-large and LLM-based systems. We also release Parakeet-TDT-0.6B-v3, a successor to v2, offering multilingual ASR across the same 25 languages with just 600M parameters.

Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST

TL;DR

The work presents Canary-1B-v2 and Parakeet-TDT-0.6B-v3, highly efficient multilingual ASR and AST models built on a FastConformer encoder and an experimental nGPT encoder. It introduces a two-stage pre-training with dynamic data balancing, a unified 25-language tokenizer, and non-speech data to mitigate hallucinations, plus a NeMo Forced Aligner–based timestamping pipeline. Across 25 languages, the models achieve competitive or superior ASR/AST performance with substantially higher throughput than larger baselines, demonstrating strong multilingual coverage and robustness to noisy conditions. The paper also investigates long-form inference via chunking and positional encodings (RoPE vs ALiBi) and releases Parakeet-TDT-0.6B-v3 to extend multilingual ASR to resource-constrained settings.

Abstract

This report introduces Canary-1B-v2, a fast, robust multilingual model for Automatic Speech Recognition (ASR) and Speech-to-Text Translation (AST). Built with a FastConformer encoder and Transformer decoder, it supports 25 languages primarily European. The model was trained on 1.7M hours of total data samples, including Granary and NeMo ASR Set 3.0, with non-speech audio added to reduce hallucinations for ASR and AST. We describe its two-stage pre-training and fine-tuning process with dynamic data balancing, as well as experiments with an nGPT encoder. Results show nGPT scales well with massive data, while FastConformer excels after fine-tuning. For timestamps, Canary-1B-v2 uses the NeMo Forced Aligner (NFA) with an auxiliary CTC model, providing reliable segment-level timestamps for ASR and AST. Evaluations show Canary-1B-v2 outperforms Whisper-large-v3 on English ASR while being 10x faster, and delivers competitive multilingual ASR and AST performance against larger models like Seamless-M4T-v2-large and LLM-based systems. We also release Parakeet-TDT-0.6B-v3, a successor to v2, offering multilingual ASR across the same 25 languages with just 600M parameters.

Paper Structure

This paper contains 39 sections, 13 figures, 17 tables.

Figures (13)

  • Figure 1: Overview of the FastConformer architecture. The input features are first subsampled to a resolution of 80 ms, followed by a linear projection and dropout for dimensionality reduction and regularization. The core of the model consists of a stack of repeated Conformer blocks, each containing four main components: Layer Normalization (LN), a Feed-Forward (FF) module, a Multi-Head Attention (MHA) mechanism, and a Convolutional Module (CC).
  • Figure 2: Overview of the nGPT architecture. The input features are first processed by a linear subsampling block. This is followed by a stack of repeated nGPT layers and a Transformer decoder. Each nGPT layer adopts a Transformer-style design with Rotary Position Embeddings (RoPE), multi-head attention, and gated feed-forward modules. Normalization is applied across the embedding dimension, and both activations and weight matrices are normalized, with weight normalization additionally enforced after each optimizer update.
  • Figure 3: Symmetric ALiBi bias matrix used in the nGPT encoder. Unlike the original causal ALiBi, which penalizes only future positions, this non-causal variant applies equal bias to tokens before and after the current position, ensuring balanced treatment of left and right context.
  • Figure 4: Overview of participating corpora in the training set with within-corpus task divisions.
  • Figure 5: Language (non-English) Training Hours Distribution by Task.
  • ...and 8 more figures