Table of Contents
Fetching ...

When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation

Michael H. Coen

TL;DR

Dialogue topic segmentation lacks a universally correct boundary set and is highly sensitive to annotation granularity. The authors propose a granularity-aware evaluation framework that reports window-tolerant F1 (W-F1), boundary density (BOR), and segment coherence (purity and coverage), and they separate boundary scoring from boundary selection to explicitly control density. Through synthetic pretraining, supervised fine-tuning, and calibration across eight diverse datasets, they show that many reported gains stem from density alignment rather than genuine boundary detection improvements, and that coherence diagnostics are essential for interpreting results. The work provides practical guidance for downstream applications in conversational memory, retrieval, and summarization, and argues for reporting density and coherence alongside traditional boundary metrics. Overall, it reframes evaluation as a design choice about granularity that should be tuned to downstream use rather than optimized to match a fixed annotation scheme.

Abstract

Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of prior work, evaluation practice in dialogue topic segmentation remains dominated by strict boundary matching and F1-based metrics, even as modern LLM-based conversational systems increasingly rely on segmentation to manage conversation history beyond the model's fixed context window, where unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation objective for dialogue topic segmentation that treats boundary density and segment coherence as primary criteria, alongside window-tolerant F1 (W-F1). Through extensive cross-dataset empirical evaluation, we show that reported performance differences across dialogue segmentation benchmarks are driven not by model quality, but by annotation granularity mismatches and sparse boundary labels. This indicates that many reported improvements arise from evaluation artifacts rather than improved boundary detection. We evaluated multiple, structurally distinct dialogue segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Across these settings, we observe high segment coherence combined with extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores. We show that topic segmentation is best understood as selecting an appropriate granularity rather than predicting a single correct boundary set. We operationalize this view by explicitly separating boundary scoring from boundary selection.

When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation

TL;DR

Dialogue topic segmentation lacks a universally correct boundary set and is highly sensitive to annotation granularity. The authors propose a granularity-aware evaluation framework that reports window-tolerant F1 (W-F1), boundary density (BOR), and segment coherence (purity and coverage), and they separate boundary scoring from boundary selection to explicitly control density. Through synthetic pretraining, supervised fine-tuning, and calibration across eight diverse datasets, they show that many reported gains stem from density alignment rather than genuine boundary detection improvements, and that coherence diagnostics are essential for interpreting results. The work provides practical guidance for downstream applications in conversational memory, retrieval, and summarization, and argues for reporting density and coherence alongside traditional boundary metrics. Overall, it reframes evaluation as a design choice about granularity that should be tuned to downstream use rather than optimized to match a fixed annotation scheme.

Abstract

Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of prior work, evaluation practice in dialogue topic segmentation remains dominated by strict boundary matching and F1-based metrics, even as modern LLM-based conversational systems increasingly rely on segmentation to manage conversation history beyond the model's fixed context window, where unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation objective for dialogue topic segmentation that treats boundary density and segment coherence as primary criteria, alongside window-tolerant F1 (W-F1). Through extensive cross-dataset empirical evaluation, we show that reported performance differences across dialogue segmentation benchmarks are driven not by model quality, but by annotation granularity mismatches and sparse boundary labels. This indicates that many reported improvements arise from evaluation artifacts rather than improved boundary detection. We evaluated multiple, structurally distinct dialogue segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Across these settings, we observe high segment coherence combined with extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores. We show that topic segmentation is best understood as selecting an appropriate granularity rather than predicting a single correct boundary set. We operationalize this view by explicitly separating boundary scoring from boundary selection.

Paper Structure

This paper contains 41 sections, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Granularity mismatch in dialogue topic segmentation. Gold annotations mark a single coarse topic boundary (browsing $\rightarrow$ booking), while the model identifies multiple semantically meaningful subtopic transitions. The fine-grained boundaries correspond to coherent conversational phases and are not boundary-positioning errors. Under exact-match evaluation, three valid predicted subtopic boundaries are counted as false positives, yielding F1=0.40 despite correct identification of the coarse transition. The error arises from a mismatch in boundary density and segmentation granularity rather than from incorrect boundary detection.
  • Figure 2: Adaptive boundary selection enforces invariant boundary distributions across scoring granularities (DialSeg711).Top row shows the raw candidate boundary rate produced by the same scorer under fine (left) and coarse (right) candidate thresholds, resulting in substantially different candidate frequencies (73.5% vs. 57.7%). The dashed line indicates the target selection rate specified to the controller. Bottom row shows the resulting distribution of selected boundaries over message index after adaptive thresholding. Despite large differences in candidate generation, adaptive selection converges to nearly identical boundary distributions and total selection counts (16.7% selection rate, $N{=}542$ in both cases). This invariance is not the result of score rescaling or post-hoc normalization: the underlying scoring function is unchanged, and the effect emerges from online evidence accumulation and adaptive thresholding, reflecting successful decoupling of boundary selection from boundary scoring.