Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Fabian Retkowski; Maike Züfle; Thai Binh Nguyen; Jan Niehues; Alexander Waibel

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Fabian Retkowski, Maike Züfle, Thai Binh Nguyen, Jan Niehues, Alexander Waibel

TL;DR

The experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following, yet MLLMs are promising on shorter audio.

Abstract

Audio chaptering, the task of automatically segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following, yet MLLMs are promising on shorter audio.

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

TL;DR

Abstract

Paper Structure (77 sections, 2 equations, 8 figures, 18 tables)

This paper contains 77 sections, 2 equations, 8 figures, 18 tables.

Introduction
Evaluation Protocols
Text-Based Segmentation
Text-Space Protocols
Evaluation on Ref. Transcripts (R1)
Evaluation on ASR Transcripts (H1)
Alignment to Reference Text (H2/H3)
Time-Space Protocols
Discrete-Time Evaluation (T1)
Continuous-Time Evaluation (T2)
Approaches
Text-Based Baseline
Hand-Crafted Audio Features
Feature Fusion
Audio-Only Model
...and 62 more sections

Figures (8)

Figure 1: AudioSeg processes input audio of duration $D$ through three stages: Frame Encoding extracts frame-level features from 30s chunks using a frozen audio encodersnowflake; Segment Encoding groups frames into 6s windows and encodes each via a Local Segment Transformerfire with [CLS] pooling to produce $K = \lceil D/\Delta t \rceil$ segment embeddings; Document Encoding processes the segment sequence through a RoFormer encoderfire to predict a binary boundary sequence $(b_1, \ldots, b_K)$ indicating chapter boundaries.
Figure 2: Relation between duration and segmentation performance across models. Smoothed using LOESS.
Figure 3: Relation between dominant speaker proportion and segmentation performance across models, for videos $<$30 minutes. Smoothed using LOESS.
Figure A1: LLaMA 3.1 8B system and user prompts for transcript chaptering.
Figure A2: Qwen system prompts and transcription prompt.
...and 3 more figures

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

TL;DR

Abstract

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Authors

TL;DR

Abstract

Table of Contents

Figures (8)