Table of Contents
Fetching ...

PMIScore: An Unsupervised Approach to Quantify Dialogue Engagement

Yongkang Guo, Zhihuan Huang, Yuqing Kong

Abstract

High dialogue engagement is a crucial indicator of an effective conversation. A reliable measure of engagement could help benchmark large language models, enhance the effectiveness of human-computer interactions, or improve personal communication skills. However, quantifying engagement is challenging, since it is subjective and lacks a "gold standard". This paper proposes PMIScore, an efficient unsupervised approach to quantify dialogue engagement. It uses pointwise mutual information (PMI), which is the probability of generating a response conditioning on the conversation history. Thus, PMIScore offers a clear interpretation of engagement. As directly computing PMI is intractable due to the complexity of dialogues, PMIScore learned it through a dual form of divergence. The algorithm includes generating positive and negative dialogue pairs, extracting embeddings by large language models (LLMs), and training a small neural network using a mutual information loss function. We validated PMIScore on both synthetic and real-world datasets. Our results demonstrate the effectiveness of PMIScore in PMI estimation and the reasonableness of the PMI metric itself.

PMIScore: An Unsupervised Approach to Quantify Dialogue Engagement

Abstract

High dialogue engagement is a crucial indicator of an effective conversation. A reliable measure of engagement could help benchmark large language models, enhance the effectiveness of human-computer interactions, or improve personal communication skills. However, quantifying engagement is challenging, since it is subjective and lacks a "gold standard". This paper proposes PMIScore, an efficient unsupervised approach to quantify dialogue engagement. It uses pointwise mutual information (PMI), which is the probability of generating a response conditioning on the conversation history. Thus, PMIScore offers a clear interpretation of engagement. As directly computing PMI is intractable due to the complexity of dialogues, PMIScore learned it through a dual form of divergence. The algorithm includes generating positive and negative dialogue pairs, extracting embeddings by large language models (LLMs), and training a small neural network using a mutual information loss function. We validated PMIScore on both synthetic and real-world datasets. Our results demonstrate the effectiveness of PMIScore in PMI estimation and the reasonableness of the PMI metric itself.
Paper Structure (68 sections, 7 theorems, 26 equations, 4 figures, 3 tables)

This paper contains 68 sections, 7 theorems, 26 equations, 4 figures, 3 tables.

Key Result

Theorem 5.3

Under the assumptions described above, as the number of samples $n \to \infty$, the PMIScore converges to the pointwise mutual information:

Figures (4)

  • Figure 1: Flow of PMIScore.
  • Figure 2: PMIScore vs. baselines on synthetic distributions. The figure shows the Spearman correlation (left) and mean squared error (right) of different methods (PMIScore, MINE, InfoNCE, and KDE) on three synthetic dependency structures (Diagonal, Block, and Independent). Higher Spearman values indicate better alignment with the true PMI ranking and lower MSE means better approximation of the accurate PMI. PMIScore achieves consistently highest rank correlation and lowest MSE across almost all settings.
  • Figure 3: Consistency of PMI estimation for different methods. Each panel compares the estimated versus ground-truth pointwise mutual information (PMI) for four estimation methods—PMIScore, MINE, InfoNCE, and KDE—on the Block synthetic dataset with the Qwen3-4B embeddings. To illustrate the magnitude of estimation error, 1,000 representative samples are plotted in each figure. The x-axis is the ground-truth of PMI and the y-axis is the estimated value. The dashed line means that the estimated PMI perfectly matchs the ground truth. While PMIScore aligns closely with the identity line, indicating minimal estimation bias, alternative estimators display noticeable deviations or slope distortions. Similar patterns are also observed across other embedding models and synthetic datasets.
  • Figure 4: PMIScore vs. baselines on empirical dialogue datasets. Bars show mean across different llms for English (dstc_en) and Chinese (dstc_zh). Left: ROC-AUC on val/test for the response-ranking task. Right: Spearman correlation with human relevance on the dev sets. Higher ROC-AUC indicates that the score can better distinguish the positive and negative pairs. PMIScore closely tracks InfoNCE on AUC and yields the best human correlation on English while remaining competitive on Chinese; both substantially outperform KDE.

Theorems & Definitions (15)

  • Definition 3.1: Mutual Information
  • Definition 3.2: Pointwise Mutual Information
  • Definition 3.3: Dual Form of Mutual Information
  • Theorem 5.3
  • Remark 5.4
  • Definition A.1: $f$-Mutual Information kong2019information
  • lemma 1: Properties of $f$-Mutual Information kong2019information
  • Definition A.2: Fenchel Duality rockafellar2015convex
  • lemma 2: Dual Form of $f$-Divergencenguyen2010estimating
  • Definition A.3: $f$-Divergence ali1966general
  • ...and 5 more