Table of Contents
Fetching ...

Efficient Ensemble for Multimodal Punctuation Restoration using Time-Delay Neural Network

Xing Yi Liu, Homayoon Beigi

TL;DR

This work tackles punctuation restoration in ASR by introducing EfficientPunct, a multimodal framework that fuses text-based predictions from a fine-tuned BERT with audio-based cues from a Kaldi TED-LIUM 3 pipeline using forced alignment. The core novelty lies in replacing attention-based fusion with alignment-driven concatenation and a time-delay neural network, enabling an ensemble that outperforms previous state-of-the-art while using substantially fewer parameters ($< rac{1}{10}$ of prior inference-network size). The authors demonstrate that a carefully balanced ensemble, especially with a modest emphasis on language cues ($\alpha \approx 0.4$), yields superior $F1$ across commas, full stops, and question marks on MuST-C v1 data. The approach offers practical benefits for real-time punctuation restoration in ASR systems, combining efficiency with strong predictive performance, and points to future directions in joint training and multilingual extensions.

Abstract

Punctuation restoration plays an essential role in the post-processing procedure of automatic speech recognition, but model efficiency is a key requirement for this task. To that end, we present EfficientPunct, an ensemble method with a multimodal time-delay neural network that outperforms the current best model by 1.0 F1 points, using less than a tenth of its inference network parameters. We streamline a speech recognizer to efficiently output hidden layer acoustic embeddings for punctuation restoration, as well as BERT to extract meaningful text embeddings. By using forced alignment and temporal convolutions, we eliminate the need for attention-based fusion, greatly increasing computational efficiency and raising performance. EfficientPunct sets a new state of the art with an ensemble that weights BERT's purely language-based predictions slightly more than the multimodal network's predictions. Our code is available at https://github.com/lxy-peter/EfficientPunct.

Efficient Ensemble for Multimodal Punctuation Restoration using Time-Delay Neural Network

TL;DR

This work tackles punctuation restoration in ASR by introducing EfficientPunct, a multimodal framework that fuses text-based predictions from a fine-tuned BERT with audio-based cues from a Kaldi TED-LIUM 3 pipeline using forced alignment. The core novelty lies in replacing attention-based fusion with alignment-driven concatenation and a time-delay neural network, enabling an ensemble that outperforms previous state-of-the-art while using substantially fewer parameters ( of prior inference-network size). The authors demonstrate that a carefully balanced ensemble, especially with a modest emphasis on language cues (), yields superior across commas, full stops, and question marks on MuST-C v1 data. The approach offers practical benefits for real-time punctuation restoration in ASR systems, combining efficiency with strong predictive performance, and points to future directions in joint training and multilingual extensions.

Abstract

Punctuation restoration plays an essential role in the post-processing procedure of automatic speech recognition, but model efficiency is a key requirement for this task. To that end, we present EfficientPunct, an ensemble method with a multimodal time-delay neural network that outperforms the current best model by 1.0 F1 points, using less than a tenth of its inference network parameters. We streamline a speech recognizer to efficiently output hidden layer acoustic embeddings for punctuation restoration, as well as BERT to extract meaningful text embeddings. By using forced alignment and temporal convolutions, we eliminate the need for attention-based fusion, greatly increasing computational efficiency and raising performance. EfficientPunct sets a new state of the art with an ensemble that weights BERT's purely language-based predictions slightly more than the multimodal network's predictions. Our code is available at https://github.com/lxy-peter/EfficientPunct.
Paper Structure (17 sections, 3 equations, 2 figures, 5 tables)

This paper contains 17 sections, 3 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The EfficientPunct framework. The top branch predicts using text only, while the bottom branch predicts using text and audio.
  • Figure 2: An example of preparing a data sample. We take 301 frames/columns, centered at the punctuation mark, from the matrix of concatenated text and audio embeddings.