Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features
Steffen Freisinger, Philipp Seeberger, Tobias Bocklet, Korbinian Riedhammer
TL;DR
This work tackles topic segmentation in spoken content by leveraging boundary-centered acoustic cues through an end-to-end multimodal framework. The authors introduce MultiSeg, which augments sentence-text embeddings with inter-sentence boundary audio features produced by a Siamese boundary audio encoder and processes the fused representation with a RoFormer tagger. End-to-end fine-tuning of the audio encoder and 2-second boundary windows yield substantial gains over text-only baselines and prior multimodal approaches, while demonstrating robustness to ASR noise and improved generalization in cross-lingual datasets. The approach demonstrates practical impact for navigating and retrieving information in podcasts and videos, and the authors provide code to facilitate replication and extension.
Abstract
Spoken content, such as online videos and podcasts, often spans multiple topics, which makes automatic topic segmentation essential for user navigation and downstream applications. However, current methods do not fully leverage acoustic features, leaving room for improvement. We propose a multi-modal approach that fine-tunes both a text encoder and a Siamese audio encoder, capturing acoustic cues around sentence boundaries. Experiments on a large-scale dataset of YouTube videos show substantial gains over text-only and multi-modal baselines. Our model also proves more resilient to ASR noise and outperforms a larger text-only baseline on three additional datasets in Portuguese, German, and English, underscoring the value of learned acoustic features for robust topic segmentation.
