Table of Contents
Fetching ...

Studying the Effects of Collaboration in Interactive Theme Discovery Systems

Alvin Po-Chun Chen, Rohan Das, Dananjay Srinivas, Alexandra Barry, Maksim Seniw, Maria Leonor Pacheco

TL;DR

This work addresses the lack of standardized evaluation for NLP-assisted qualitative coding by proposing a framework that assesses consistency, cohesiveness, and correctness across synchronous and asynchronous collaboration. It experimentally compares three diverse interactive tools (topic-model-based, relational, and LLM-based) on a large COVID-19 vaccine tweet dataset. Key findings show that collaboration modality markedly influences output quality for some tools, with synchronous deliberation boosting consistency and cohesion, while LLM-based approaches raise concerns about scalability and reliability. The paper provides actionable recommendations and a generalizable evaluation framework to guide robust, real-world assessments of HitL qualitative coding tools.

Abstract

NLP-assisted solutions have gained considerable traction to support qualitative data analysis. However, there does not exist a unified evaluation framework that can account for the many different settings in which qualitative researchers may employ them. In this paper, we take a first step in this direction by proposing an evaluation framework to study the way in which different tools may result in different outcomes depending on the collaboration strategy employed. Specifically, we study the impact of synchronous vs. asynchronous collaboration using two different NLP-assisted qualitative research tools and present a comprehensive analysis of significant differences in the consistency, cohesiveness, and correctness of their outputs.

Studying the Effects of Collaboration in Interactive Theme Discovery Systems

TL;DR

This work addresses the lack of standardized evaluation for NLP-assisted qualitative coding by proposing a framework that assesses consistency, cohesiveness, and correctness across synchronous and asynchronous collaboration. It experimentally compares three diverse interactive tools (topic-model-based, relational, and LLM-based) on a large COVID-19 vaccine tweet dataset. Key findings show that collaboration modality markedly influences output quality for some tools, with synchronous deliberation boosting consistency and cohesion, while LLM-based approaches raise concerns about scalability and reliability. The paper provides actionable recommendations and a generalizable evaluation framework to guide robust, real-world assessments of HitL qualitative coding tools.

Abstract

NLP-assisted solutions have gained considerable traction to support qualitative data analysis. However, there does not exist a unified evaluation framework that can account for the many different settings in which qualitative researchers may employ them. In this paper, we take a first step in this direction by proposing an evaluation framework to study the way in which different tools may result in different outcomes depending on the collaboration strategy employed. Specifically, we study the impact of synchronous vs. asynchronous collaboration using two different NLP-assisted qualitative research tools and present a comprehensive analysis of significant differences in the consistency, cohesiveness, and correctness of their outputs.
Paper Structure (18 sections, 4 figures, 3 tables)

This paper contains 18 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: In this study, we measure the quality of coded themes using different interactive systems under different coding configurations.
  • Figure 2: Two sets of annotators use a particular HiTL system to find themes. Since the same theme can be named differently by different annotators, we find the best match. In this example, the annotator 1's theme "VaxIsBad" has been matched with annotator 2's theme "antivax". After aligning, we calculate the similarity between these two themes using methods like Jaccard Similarity or Centroid Distance.
  • Figure 3: Once an annotator has identified themes and they have been propagated the full dataset, we calculate intra-theme similarity by measuring the avg. of the pairwise distances between each document within a theme (left). We calculate inter-theme similarity by measuring the avg. of pairwise distances between each document in a theme and documents assigned to all other themes (right)
  • Figure 4: Correctness w.r.t. distance from theme.