Table of Contents
Fetching ...

The TCG CREST -- RKMVERI Submission for the NCIIPC Startup India AI Grand Challenge

Nikhil Raghav, Arnab Banerjee, Janojit Chakraborty, Avisek Gupta, Swami Punyeshwarananda, Md Sahidullah

TL;DR

The paper presents an integrated multilingual audio processing pipeline for PS-06 of the NCIIPC Startup India AI Grand Challenge, focusing on language-agnostic speaker identification and diarisation, followed by transcription and translation. The approach combines a front-end Silero-VAD, ECAPA-TDNN-based embeddings, unsupervised multi-kernel spectral clustering for diarization, language identification with VoxLingua107, ASR with language-appropriate models, and NMT using IndicTrans2 and Opus-MT. Key contributions include a robust SD method with MK-SGCSC, a SID workflow with enrollment-based cosine scoring and smoothing, and end-to-end evaluation on mock data with metrics such as IER, DER, WER, and BLEU. The results demonstrate stable performance across multilingual, code-mixed inputs and indicate areas for future improvement, including cross-lingual representations and real-time processing.

Abstract

In this report, we summarize the integrated multilingual audio processing pipeline developed by our team for the inaugural NCIIPC Startup India AI GRAND CHALLENGE, addressing Problem Statement 06: Language-Agnostic Speaker Identification and Diarisation, and subsequent Transcription and Translation System. Our primary focus was on advancing speaker diarization, a critical component for multilingual and code-mixed scenarios. The main intent of this work was to study the real-world applicability of our in-house speaker diarization (SD) systems. To this end, we investigated a robust voice activity detection (VAD) technique and fine-tuned speaker embedding models for improved speaker identification in low-resource settings. We leveraged our own recently proposed multi-kernel consensus spectral clustering framework, which substantially improved the diarization performance across all recordings in the training corpus provided by the organizers. Complementary modules for speaker and language identification, automatic speech recognition (ASR), and neural machine translation were integrated in the pipeline. Post-processing refinements further improved system robustness.

The TCG CREST -- RKMVERI Submission for the NCIIPC Startup India AI Grand Challenge

TL;DR

The paper presents an integrated multilingual audio processing pipeline for PS-06 of the NCIIPC Startup India AI Grand Challenge, focusing on language-agnostic speaker identification and diarisation, followed by transcription and translation. The approach combines a front-end Silero-VAD, ECAPA-TDNN-based embeddings, unsupervised multi-kernel spectral clustering for diarization, language identification with VoxLingua107, ASR with language-appropriate models, and NMT using IndicTrans2 and Opus-MT. Key contributions include a robust SD method with MK-SGCSC, a SID workflow with enrollment-based cosine scoring and smoothing, and end-to-end evaluation on mock data with metrics such as IER, DER, WER, and BLEU. The results demonstrate stable performance across multilingual, code-mixed inputs and indicate areas for future improvement, including cross-lingual representations and real-time processing.

Abstract

In this report, we summarize the integrated multilingual audio processing pipeline developed by our team for the inaugural NCIIPC Startup India AI GRAND CHALLENGE, addressing Problem Statement 06: Language-Agnostic Speaker Identification and Diarisation, and subsequent Transcription and Translation System. Our primary focus was on advancing speaker diarization, a critical component for multilingual and code-mixed scenarios. The main intent of this work was to study the real-world applicability of our in-house speaker diarization (SD) systems. To this end, we investigated a robust voice activity detection (VAD) technique and fine-tuned speaker embedding models for improved speaker identification in low-resource settings. We leveraged our own recently proposed multi-kernel consensus spectral clustering framework, which substantially improved the diarization performance across all recordings in the training corpus provided by the organizers. Complementary modules for speaker and language identification, automatic speech recognition (ASR), and neural machine translation were integrated in the pipeline. Post-processing refinements further improved system robustness.

Paper Structure

This paper contains 19 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The five tasks in the problem statement PS-06.
  • Figure 2: Our approach for building an integrated audio processing pipeline. The first block of VAD serves as an input to SID, LID, and SD, respectively. Further, the output of the LID serves as an input to the ASR, and likewise for the NMT module.
  • Figure 3: Empirical genuine and impostor cosine-similarity score distributions for the enrollment speaker in ID16.ogg. The analytic threshold $\Delta = 0.3147$ lies in the low-density valley between the two distributions.