Table of Contents
Fetching ...

Improving the TENOR of Labeling: Re-evaluating Topic Models for Content Analysis

Zongxia Li, Andrew Mao, Daniel Stephens, Pranav Goel, Emily Walpole, Alden Dima, Juan Fung, Jordan Boyd-Graber

TL;DR

It is shown that current automated metrics do not provide a complete picture of topic modeling capabilities, but the right choice of NTMs can be better than classical models on practical tasks, and the Contextual Neural Topic Model does the best on cluster evaluation metrics and human evaluations.

Abstract

Topic models are a popular tool for understanding text collections, but their evaluation has been a point of contention. Automated evaluation metrics such as coherence are often used, however, their validity has been questioned for neural topic models (NTMs) and can overlook a models benefits in real world applications. To this end, we conduct the first evaluation of neural, supervised and classical topic models in an interactive task based setting. We combine topic models with a classifier and test their ability to help humans conduct content analysis and document annotation. From simulated, real user and expert pilot studies, the Contextual Neural Topic Model does the best on cluster evaluation metrics and human evaluations; however, LDA is competitive with two other NTMs under our simulated experiment and user study results, contrary to what coherence scores suggest. We show that current automated metrics do not provide a complete picture of topic modeling capabilities, but the right choice of NTMs can be better than classical models on practical task.

Improving the TENOR of Labeling: Re-evaluating Topic Models for Content Analysis

TL;DR

It is shown that current automated metrics do not provide a complete picture of topic modeling capabilities, but the right choice of NTMs can be better than classical models on practical tasks, and the Contextual Neural Topic Model does the best on cluster evaluation metrics and human evaluations.

Abstract

Topic models are a popular tool for understanding text collections, but their evaluation has been a point of contention. Automated evaluation metrics such as coherence are often used, however, their validity has been questioned for neural topic models (NTMs) and can overlook a models benefits in real world applications. To this end, we conduct the first evaluation of neural, supervised and classical topic models in an interactive task based setting. We combine topic models with a classifier and test their ability to help humans conduct content analysis and document annotation. From simulated, real user and expert pilot studies, the Contextual Neural Topic Model does the best on cluster evaluation metrics and human evaluations; however, LDA is competitive with two other NTMs under our simulated experiment and user study results, contrary to what coherence scores suggest. We show that current automated metrics do not provide a complete picture of topic modeling capabilities, but the right choice of NTMs can be better than classical models on practical task.
Paper Structure (54 sections, 9 equations, 7 figures, 5 tables)

This paper contains 54 sections, 9 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Cluster scores of simulated labeling experiments, median of 15 runs. ctm with active learning has the highest score across all metrics and datasets. lda and slda are better than or competitive with the other ntms (etm, bert opic). Given these results on synthetic data, we use ctm for the human experiments.
  • Figure 2: User study label cluster metrics plotted against time. For each group, we take the median of each metric for every minute passed. The user study results are similar to the simulated experiment; ctm does the best on all three clustering metrics.
  • Figure 3: The first Plot shows npmi Coherence for all topics on the Bills dataset, where slda(user) is trained on user input labels, and slda is the initial model used for all slda users. The rest of the plots shows users' rating on different questions on a scale 1 to 7, which the higher is better. Although slda is worse than lda and ctm on clustering evaluations, most of the median of user ratings do not differ from ctm, and surpass lda in some ratings. For ratings 2 to 4, none groups users all rate 0 because they do not have access to those features
  • Figure 4: We run a followup pilot study with six social science experts (three in each group) on their internal social science dataset (800 documents). They are familiar with the topics in the dataset. Up to the 50th document labeled, ctm still generalizes well for expert datasets and expert users.
  • Figure 5: This is the overview (1) none group. Users are not presented with topic overview, but active learning classifier picks the document based on the preference function and place it on top of the page.
  • ...and 2 more figures