Table of Contents
Fetching ...

Consistent Kernel Change-Point Detection under m-Dependence for Text Segmentation

Jairo Diaz-Rodriguez, Mumin Jia

TL;DR

The paper extends kernel change-point detection (KCPD) to sequences with $m$-dependent structure, establishing consistency in the number of detected change points and weak consistency in their locations. It proves these results under mild assumptions and introduces an LLM-based simulation to validate the asymptotics, followed by a comprehensive empirical study of KCPD for text segmentation using modern sentence embeddings. Empirical results across diverse datasets show KCPD with embeddings and kernels (e.g., cosine) achieving competitive or superior performance to unsupervised baselines and approaching supervised methods, with a real-world Taylor Swift tweet case study illustrating practical utility. By linking nonparametric RKHS-based change-point theory with state-of-the-art NLP representations, the work tightly couples statistical guarantees to practical text-segmentation tasks.

Abstract

Kernel change-point detection (KCPD) has become a widely used tool for identifying structural changes in complex data. While existing theory establishes consistency under independence assumptions, real-world sequential data such as text exhibits strong dependencies. We establish new guarantees for KCPD under $m$-dependent data: specifically, we prove consistency in the number of detected change points and weak consistency in their locations under mild additional assumptions. We perform an LLM-based simulation that generates synthetic $m$-dependent text to validate the asymptotics. To complement these results, we present the first comprehensive empirical study of KCPD for text segmentation with modern embeddings. Across diverse text datasets, KCPD with text embeddings outperforms baselines in standard text segmentation metrics. We demonstrate through a case study on Taylor Swift's tweets that KCPD not only provides strong theoretical and simulated reliability but also practical effectiveness for text segmentation tasks.

Consistent Kernel Change-Point Detection under m-Dependence for Text Segmentation

TL;DR

The paper extends kernel change-point detection (KCPD) to sequences with -dependent structure, establishing consistency in the number of detected change points and weak consistency in their locations. It proves these results under mild assumptions and introduces an LLM-based simulation to validate the asymptotics, followed by a comprehensive empirical study of KCPD for text segmentation using modern sentence embeddings. Empirical results across diverse datasets show KCPD with embeddings and kernels (e.g., cosine) achieving competitive or superior performance to unsupervised baselines and approaching supervised methods, with a real-world Taylor Swift tweet case study illustrating practical utility. By linking nonparametric RKHS-based change-point theory with state-of-the-art NLP representations, the work tightly couples statistical guarantees to practical text-segmentation tasks.

Abstract

Kernel change-point detection (KCPD) has become a widely used tool for identifying structural changes in complex data. While existing theory establishes consistency under independence assumptions, real-world sequential data such as text exhibits strong dependencies. We establish new guarantees for KCPD under -dependent data: specifically, we prove consistency in the number of detected change points and weak consistency in their locations under mild additional assumptions. We perform an LLM-based simulation that generates synthetic -dependent text to validate the asymptotics. To complement these results, we present the first comprehensive empirical study of KCPD for text segmentation with modern embeddings. Across diverse text datasets, KCPD with text embeddings outperforms baselines in standard text segmentation metrics. We demonstrate through a case study on Taylor Swift's tweets that KCPD not only provides strong theoretical and simulated reliability but also practical effectiveness for text segmentation tasks.

Paper Structure

This paper contains 28 sections, 13 theorems, 101 equations, 6 figures, 3 tables.

Key Result

Theorem 1

Let Assumptions A1--A5 hold. Then

Figures (6)

  • Figure 1: Segmentation accuracies versus sequence length $T$ for KCPD applied to synthetically generated $m$-dependent text data with GPT-4.1 and $m=20$. Curves compare three embedding methods (sBERT, MPNet, OpenAI). Dashed red line shows the growth of the number of change points $K \approx 2\log T$.
  • Figure 2: Timeline of Taylor Swift’s tweet stream segmented by KCPD using RBF and cosine kernels. Each segment is annotated with its tweet count (blue boxes) and an interpretation of its content (pink boxes).
  • Figure 3: Sensitivity of the detected breakpoints to the parameter $C$ on Taylor Swift’s tweet stream.
  • Figure 4: Sensitivity of $C$ with cosine and RBF kernel.
  • Figure 5: $P_k$ error (%) versus sequence length $T$ for KCPD applied to synthetically generated $m$-dependent text data with GPT-4.1, $m=20$, for multiple values of $C$ and sBERT embeddings.
  • ...and 1 more figures

Theorems & Definitions (24)

  • Theorem 1: Consistency in the number of change-points
  • Theorem 2: Weak location consistency
  • Proposition 1: m-dependent concentration for segment cost
  • proof
  • Lemma 3: Uniform deviation over all segments
  • proof
  • Proposition 2: Stability on homogeneous segments
  • proof
  • Lemma 4: Signal strength on a mixed segment
  • proof
  • ...and 14 more