Table of Contents
Fetching ...

TreeSeg: Hierarchical Topic Segmentation of Large Transcripts

Dimitrios C. Gklezakos, Timothy Misiak, Diamond Bishop

TL;DR

TreeSeg tackles the challenge of hierarchically segmenting long ASR-transcribed transcripts by combining block-context utterance embeddings with a divisive, unsupervised clustering scheme to produce a binary partition tree. The method embeds each position via overlapping utterance blocks using off-the-shelf embeddings (e.g., text-embedding-ada-002) and recursively identifies optimal split points through a one-dimensional loss that compares cluster centers, with a minimum segment size constraint. It evaluates on ICSI, AMI, and TinyRec, outperforming baselines such as BertSeg, HyperSeg, and naive strategies across multiple hierarchical levels using $P_k$ and WinDiff metrics. The work contributes a fully unsupervised, parameter-efficient approach that yields controllable segmentation granularity and introduces TinyRec as a modest, manually annotated corpus to complement large meeting datasets. This approach has practical impact for organizing long transcripts into chapters and for enabling downstream tasks that require bounded context, such as summarization or knowledge extraction, without requiring labeled data.

Abstract

From organizing recorded videos and meetings into chapters, to breaking down large inputs in order to fit them into the context window of commoditized Large Language Models (LLMs), topic segmentation of large transcripts emerges as a task of increasing significance. Still, accurate segmentation presents many challenges, including (a) the noisy nature of the Automatic Speech Recognition (ASR) software typically used to obtain the transcripts, (b) the lack of diverse labeled data and (c) the difficulty in pin-pointing the ground-truth number of segments. In this work we present TreeSeg, an approach that combines off-the-shelf embedding models with divisive clustering, to generate hierarchical, structured segmentations of transcripts in the form of binary trees. Our approach is robust to noise and can handle large transcripts efficiently. We evaluate TreeSeg on the ICSI and AMI corpora, demonstrating that it outperforms all baselines. Finally, we introduce TinyRec, a small-scale corpus of manually annotated transcripts, obtained from self-recorded video sessions.

TreeSeg: Hierarchical Topic Segmentation of Large Transcripts

TL;DR

TreeSeg tackles the challenge of hierarchically segmenting long ASR-transcribed transcripts by combining block-context utterance embeddings with a divisive, unsupervised clustering scheme to produce a binary partition tree. The method embeds each position via overlapping utterance blocks using off-the-shelf embeddings (e.g., text-embedding-ada-002) and recursively identifies optimal split points through a one-dimensional loss that compares cluster centers, with a minimum segment size constraint. It evaluates on ICSI, AMI, and TinyRec, outperforming baselines such as BertSeg, HyperSeg, and naive strategies across multiple hierarchical levels using and WinDiff metrics. The work contributes a fully unsupervised, parameter-efficient approach that yields controllable segmentation granularity and introduces TinyRec as a modest, manually annotated corpus to complement large meeting datasets. This approach has practical impact for organizing long transcripts into chapters and for enabling downstream tasks that require bounded context, such as summarization or knowledge extraction, without requiring labeled data.

Abstract

From organizing recorded videos and meetings into chapters, to breaking down large inputs in order to fit them into the context window of commoditized Large Language Models (LLMs), topic segmentation of large transcripts emerges as a task of increasing significance. Still, accurate segmentation presents many challenges, including (a) the noisy nature of the Automatic Speech Recognition (ASR) software typically used to obtain the transcripts, (b) the lack of diverse labeled data and (c) the difficulty in pin-pointing the ground-truth number of segments. In this work we present TreeSeg, an approach that combines off-the-shelf embedding models with divisive clustering, to generate hierarchical, structured segmentations of transcripts in the form of binary trees. Our approach is robust to noise and can handle large transcripts efficiently. We evaluate TreeSeg on the ICSI and AMI corpora, demonstrating that it outperforms all baselines. Finally, we introduce TinyRec, a small-scale corpus of manually annotated transcripts, obtained from self-recorded video sessions.
Paper Structure (15 sections, 2 equations, 3 figures, 4 tables)

This paper contains 15 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: From linear to hierarchical topic segmentation:(a) A partition tree of depth equal to $1$ corresponding to linear topic segmentation and (b) a deeper partition tree corresponding to hierarchical topic segmentation. The root node always covers the full timeline. Note that in both cases, the children of a node form a partition of the node's segment.
  • Figure 2: Inaccurate hierarchical segmentation: An example of an accurate linear, but inaccurate hierarchical approximation of the tree in Figure \ref{['fig:partition:full']}. Note that the leaves of the output partition match those of the ground-truth partition, however the order in which the nodes are partitioned is not respected and the hierarchical structure of the segments is not properly identified.
  • Figure 3: Dividing the timeline:(a) At each step, valid candidate splitting points are identified for all leaves. (b) & (c) The optimal splitting point across all leaves is used to divide the corresponding segment into two sub-segments. The process continues until a termination criterion is met.