VoiceGRPO: Modern MoE Transformers with Group Relative Policy Optimization GRPO for AI Voice Health Care Applications on Voice Pathology Detection
Enkhtogtokh Togootogtokh, Christian Klasen
TL;DR
The paper tackles automatic voice pathology detection under data scarcity by proposing VoiceGRPO, a Mixture-of-Experts Transformer trained with Group Relative Policy Optimization. The model uses a VoiceMoETransformer with a gating network to fuse multiple expert pathways and RL-inspired training (PPO and GRPO) to stabilize updates in complex MoE architectures. A synthetic dataset mimicking clinical biomarkers is used for evaluation, and results show VoiceGRPO achieving high accuracy, F1, and ROC AUC, outperforming MoE PPO baselines. Ablation studies highlight the importance of the gating network and latent encoder, with code released to support reproducibility and future work on real-world clinical data and enhanced gating strategies.
Abstract
This research introduces a novel AI techniques as Mixture-of-Experts Transformers with Group Relative Policy Optimization (GRPO) for voice health care applications on voice pathology detection. With the architectural innovations, we adopt advanced training paradigms inspired by reinforcement learning, namely Proximal Policy Optimization (PPO) and Group-wise Regularized Policy Optimization (GRPO), to enhance model stability and performance. Experiments conducted on a synthetically generated voice pathology dataset demonstrate that our proposed models significantly improve diagnostic accuracy, F1 score, and ROC-AUC compared to conventional approaches. These findings underscore the potential of integrating transformer architectures with novel training strategies to advance automated voice pathology detection and ultimately contribute to more effective healthcare delivery. The code we used to train and evaluate our models is available at https://github.com/enkhtogtokh/voicegrpo
