The TCG CREST -- RKMVERI Submission for the NCIIPC Startup India AI Grand Challenge
Nikhil Raghav, Arnab Banerjee, Janojit Chakraborty, Avisek Gupta, Swami Punyeshwarananda, Md Sahidullah
TL;DR
The paper presents an integrated multilingual audio processing pipeline for PS-06 of the NCIIPC Startup India AI Grand Challenge, focusing on language-agnostic speaker identification and diarisation, followed by transcription and translation. The approach combines a front-end Silero-VAD, ECAPA-TDNN-based embeddings, unsupervised multi-kernel spectral clustering for diarization, language identification with VoxLingua107, ASR with language-appropriate models, and NMT using IndicTrans2 and Opus-MT. Key contributions include a robust SD method with MK-SGCSC, a SID workflow with enrollment-based cosine scoring and smoothing, and end-to-end evaluation on mock data with metrics such as IER, DER, WER, and BLEU. The results demonstrate stable performance across multilingual, code-mixed inputs and indicate areas for future improvement, including cross-lingual representations and real-time processing.
Abstract
In this report, we summarize the integrated multilingual audio processing pipeline developed by our team for the inaugural NCIIPC Startup India AI GRAND CHALLENGE, addressing Problem Statement 06: Language-Agnostic Speaker Identification and Diarisation, and subsequent Transcription and Translation System. Our primary focus was on advancing speaker diarization, a critical component for multilingual and code-mixed scenarios. The main intent of this work was to study the real-world applicability of our in-house speaker diarization (SD) systems. To this end, we investigated a robust voice activity detection (VAD) technique and fine-tuned speaker embedding models for improved speaker identification in low-resource settings. We leveraged our own recently proposed multi-kernel consensus spectral clustering framework, which substantially improved the diarization performance across all recordings in the training corpus provided by the organizers. Complementary modules for speaker and language identification, automatic speech recognition (ASR), and neural machine translation were integrated in the pipeline. Post-processing refinements further improved system robustness.
