Table of Contents
Fetching ...

TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based ASR

Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Iuliia Thorbecke, Esaú Villatoro-Tello, Sergio Burdisso, Petr Motlicek, Karthik Pandia, Aravind Ganapathiraju

TL;DR

TokenVerse tackles the fragmentation of speech-to-text systems by unifying ASR with SCD, endpointing, and NER in a single Transducer model through a token augmentation protocol. By training with task tokens embedded in reference text and employing an XLSR-53 encoder (XLSR-Transducer), the approach enables end-to-end prediction of transcripts and task events with time-aligned tokens. Empirical results on DefinedAI and CallHome show up to 7.7% relative improvements in WER and competitive performance on downstream tasks, with ablations indicating multitask training often benefits all tasks. The work offers a practical, extensible framework for joint speech and NLP processing and provides public code for broader adoption.

Abstract

In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Our code is publicly available: https://github.com/idiap/tokenverse-unifying-speech-nlp

TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based ASR

TL;DR

TokenVerse tackles the fragmentation of speech-to-text systems by unifying ASR with SCD, endpointing, and NER in a single Transducer model through a token augmentation protocol. By training with task tokens embedded in reference text and employing an XLSR-53 encoder (XLSR-Transducer), the approach enables end-to-end prediction of transcripts and task events with time-aligned tokens. Empirical results on DefinedAI and CallHome show up to 7.7% relative improvements in WER and competitive performance on downstream tasks, with ablations indicating multitask training often benefits all tasks. The work offers a practical, extensible framework for joint speech and NLP processing and provides public code for broader adoption.

Abstract

In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Our code is publicly available: https://github.com/idiap/tokenverse-unifying-speech-nlp
Paper Structure (14 sections, 2 figures, 5 tables)

This paper contains 14 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: a) Proposed unified token augmentation protocol for SCD, ENDP, and NER. b) TokenVerse unifies multiple speech and NLP tasks (e.g., T1+T2+T3) in a single model within the neural Transducer framework.
  • Figure 2: Absolute changes in text-based evaluation w.r.t all-tasks TokenVerse in @F1. We either remove a task, e.g., remove-[NE], or transfer to the removed task, e.g., transfer-to $\rightarrow$[NE]. Note that all-tasks TokenVerse performs better in all scenarios.