Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics
Kabir Kumar
TL;DR
This work benchmarks a two-stage pipeline for medical diagnostics by coupling Automatic Speech Recognition (ASR) with Large Language Models (LLMs), augmented by audio preprocessing to combat noise and clipping. It compares Whisper and wav2vec 2.0 for ASR and uses Qwen2 (LoRA-tuned) versus Llama3 for LLM-driven classification on a Kaggle medical speech dataset, emphasizing robustness and latency. Whisper achieves the lowest Word Error Rate (WER) with minimal finetuning, while Qwen2 provides fast, context-aware label classification, illustrating a practical trade-off between transcription accuracy and downstream inference. The study demonstrates a viable path toward robust medical speech understanding and potential conversational assistants in clinical workflows, with clear avenues for scaling and integration.
Abstract
Natural Language Processing (NLP) and Voice Recognition agents are rapidly evolving healthcare by enabling efficient, accessible, and professional patient support while automating grunt work. This report serves as my self project wherein models finetuned on medical call recordings are analysed through a two-stage system: Automatic Speech Recognition (ASR) for speech transcription and a Large Language Model (LLM) for context-aware, professional responses. ASR, finetuned on phone call recordings provides generalised transcription of diverse patient speech over call, while the LLM matches transcribed text to medical diagnosis. A novel audio preprocessing strategy, is deployed to provide invariance to incoming recording/call data, laden with sufficient augmentation with noise/clipping to make the pipeline robust to the type of microphone and ambient conditions the patient might have while calling/recording.
