POWSM: A Phonetic Open Whisper-Style Speech Foundation Model
Chin-Jou Li, Kalvin Chang, Shikhar Bharadwaj, Eunjung Yeo, Kwanghee Choi, Jian Zhu, David Mortensen, Shinji Watanabe
TL;DR
POWSM proposes the first unified phonetic foundation model capable of PR, ASR, audio-guided G2P, and audio-guided P2G, enabling seamless conversion among audio, graphemes, and phones with cross-lingual transfer. Built on an attention-based encoder-decoder architecture and trained from scratch on open multilingual data, POWSM jointly optimizes four tasks using a multitask data format and phonetic supervision, achieving strong PR and competitive ASR across languages, including unseen ones. The work provides a thorough analysis of encoder targets, CTC weighting, and token-driven prompting, revealing interpretable behaviors such as phonetic preservation through speech-guided G2P and language-aware phonotactics through language tokens, while releasing all data processing pipelines and checkpoints for open science. Empirically, POWSM matches or surpasses task-specific baselines on PR, approaches web-scale ASR performance on low-resource languages, and demonstrates the feasibility of a single model supporting rich phonetic tasks across 70+ languages, with practical implications for low-resource and endangered-language processing.
Abstract
Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.
