Table of Contents
Fetching ...

Discovering Elementary Discourse Units in Textual Data Using Canonical Correlation Analysis

Akanksha Mehndiratta, Krishna Asawa

TL;DR

This work introduces a probabilistic, two-view interpretation of Canonical Correlation Analysis (CCA) to discover Elementary Discourse Units (EDUs) in text without supervision. By treating adjacent sentences or dialogue turns as two views, the method learns a shared latent state $L$ and yields EDU representations that can be mapped to words via embeddings, forming a simple, linear, language-independent EDU segmentation framework. Evaluated on STSB and Mohler datasets, the approach achieves competitive performance with supervised models in low-resource settings, illustrating the potential of unsupervised EDUs for content selection and semantic similarity tasks. The study provides theoretical grounding for EDUs via CCA and demonstrates practical applicability in textual similarity with a lightweight, adaptable baseline that reduces reliance on large labeled corpora.

Abstract

Canonical Correlation Analysis (CCA) has been exploited immensely for learning latent representations in various fields. This study takes a step further by demonstrating the potential of CCA in identifying Elementary Discourse Units(EDUs) that captures the latent information within the textual data. The probabilistic interpretation of CCA discussed in this study utilizes the two-view nature of textual data, i.e. the consecutive sentences in a document or turns in a dyadic conversation, and has a strong theoretical foundation. Furthermore, this study proposes a model for Elementary Discourse Unit(EDU) segmentation that discovers EDUs in textual data without any supervision. To validate the model, the EDUs are utilized as textual unit for content selection in textual similarity task. Empirical results on Semantic Textual Similarity(STSB) and Mohler datasets confirm that, despite represented as a unigram, the EDUs deliver competitive results and can even beat various sophisticated supervised techniques. The model is simple, linear, adaptable and language independent making it an ideal baseline particularly when labeled training data is scarce or nonexistent.

Discovering Elementary Discourse Units in Textual Data Using Canonical Correlation Analysis

TL;DR

This work introduces a probabilistic, two-view interpretation of Canonical Correlation Analysis (CCA) to discover Elementary Discourse Units (EDUs) in text without supervision. By treating adjacent sentences or dialogue turns as two views, the method learns a shared latent state and yields EDU representations that can be mapped to words via embeddings, forming a simple, linear, language-independent EDU segmentation framework. Evaluated on STSB and Mohler datasets, the approach achieves competitive performance with supervised models in low-resource settings, illustrating the potential of unsupervised EDUs for content selection and semantic similarity tasks. The study provides theoretical grounding for EDUs via CCA and demonstrates practical applicability in textual similarity with a lightweight, adaptable baseline that reduces reliance on large labeled corpora.

Abstract

Canonical Correlation Analysis (CCA) has been exploited immensely for learning latent representations in various fields. This study takes a step further by demonstrating the potential of CCA in identifying Elementary Discourse Units(EDUs) that captures the latent information within the textual data. The probabilistic interpretation of CCA discussed in this study utilizes the two-view nature of textual data, i.e. the consecutive sentences in a document or turns in a dyadic conversation, and has a strong theoretical foundation. Furthermore, this study proposes a model for Elementary Discourse Unit(EDU) segmentation that discovers EDUs in textual data without any supervision. To validate the model, the EDUs are utilized as textual unit for content selection in textual similarity task. Empirical results on Semantic Textual Similarity(STSB) and Mohler datasets confirm that, despite represented as a unigram, the EDUs deliver competitive results and can even beat various sophisticated supervised techniques. The model is simple, linear, adaptable and language independent making it an ideal baseline particularly when labeled training data is scarce or nonexistent.
Paper Structure (17 sections, 2 theorems, 21 equations, 2 figures, 4 tables, 2 algorithms)

This paper contains 17 sections, 2 theorems, 21 equations, 2 figures, 4 tables, 2 algorithms.

Key Result

Lemma 1

We consider the model given in figure fig:HSM1 where a and b are two random variables of dimensions m and n respectively defined by eq:5 Here L $\in\mathbb{R}$(dim) is the shared hidden state, $I$ is an Identity matrix and $dim$ = min(m, n) is the dimension of the space onto which the views are projected. The maximum likelihood estimates of the model parameters Wa$\in\mathbb{R}$(m $\times$ dim), W

Figures (2)

  • Figure 1: Graphical model for Multi-view assumption adapted from foster2008multi
  • Figure 2: Graphical model for Latent Interpretation Model adapted from article

Theorems & Definitions (6)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • proof
  • proof