From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions
Fabian Retkowski, Alexander Waibel
TL;DR
This work addresses the lack of robust benchmarks for text segmentation in spoken, unstructured content by introducing YTSeg, a large YouTube-based transcription benchmark with chapters, alongside the concept of smart chaptering that combines segmentation with meaningful title generation. It proposes MiniSeg, a lightweight yet competitive hierarchical segmentation model, and extends evaluation to online segmentation and real-time title generation. The paper demonstrates that Wiki-727K pre-training and task adaptation can transfer to YTSeg and even improve performance on related tasks like meeting segmentation, while online settings reveal the importance of controlled future context. The findings have practical implications for structuring video transcripts in real-time applications, enabling clearer navigation and comprehension for users in educational, corporate, and content-creation contexts. Limitations include language scope, single-modality evaluation, and exposure-bias in title generation, suggesting directions toward multi-modal, multilingual smart chaptering in future work.
Abstract
Text segmentation is a fundamental task in natural language processing, where documents are split into contiguous sections. However, prior research in this area has been constrained by limited datasets, which are either small in scale, synthesized, or only contain well-structured documents. In this paper, we address these limitations by introducing a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse. As part of this work, we introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines. Lastly, we expand the notion of text segmentation to a more practical "smart chaptering" task that involves the segmentation of unstructured content, the generation of meaningful segment titles, and a potential real-time application of the models.
