Digits micro-model for accurate and secure transactions
Chirag Chhablani, Nikhita Sharma, Jordan Hosier, Vijay K. Gurbani
TL;DR
This work introduces domain-specific, lightweight micro-models for accurate multi-digit number recognition in financial-domain ASR, prioritizing privacy and low resource usage. It curates a 14,000-utterance dataset combining Timers and Such, LibriSpeech, and Aurora, with a vocabulary tailored to five-digit numbers and pronunciation variants, and labels digit sequences via Whisper-derived timestamps. Two Kaldi-based micro-models (dense and light) are trained, achieving state-of-the-art-like digit WER while using substantially less memory and training time than generic models such as Whisper or Google-STT, enabling on-premise deployment. The results demonstrate strong digit transcription performance with minimal latency and memory demands, highlighting the practicality and privacy advantages of domain-specific micro-models for sensitive financial applications.
Abstract
Automatic Speech Recognition (ASR) systems are used in the financial domain to enhance the caller experience by enabling natural language understanding and facilitating efficient and intuitive interactions. Increasing use of ASR systems requires that such systems exhibit very low error rates. The predominant ASR models to collect numeric data are large, general-purpose commercial models -- Google Speech-to-text (STT), or Amazon Transcribe -- or open source (OpenAI's Whisper). Such ASR models are trained on hundreds of thousands of hours of audio data and require considerable resources to run. Despite recent progress large speech recognition models, we highlight the potential of smaller, specialized "micro" models. Such light models can be trained perform well on number recognition specific tasks, competing with general models like Whisper or Google STT while using less than 80 minutes of training time and occupying at least an order of less memory resources. Also, unlike larger speech recognition models, micro-models are trained on carefully selected and curated datasets, which makes them highly accurate, agile, and easy to retrain, while using low compute resources. We present our work on creating micro models for multi-digit number recognition that handle diverse speaking styles reflecting real-world pronunciation patterns. Our work contributes to domain-specific ASR models, improving digit recognition accuracy, and privacy of data. An added advantage, their low resource consumption allows them to be hosted on-premise, keeping private data local instead uploading to an external cloud. Our results indicate that our micro-model makes less errors than the best-of-breed commercial or open-source ASRs in recognizing digits (1.8% error rate of our best micro-model versus 5.8% error rate of Whisper), and has a low memory footprint (0.66 GB VRAM for our model versus 11 GB VRAM for Whisper).
