Table of Contents
Fetching ...

Large Language Models for Automatic Milestone Detection in Group Discussions

Zhuoxu Duan, Zhengye Yang, Samuel Westby, Christoph Riedl, Brooke Foucault Welles, Richard J. Radke

TL;DR

The paper tackles automatic milestone detection in group discussions using noisy transcripts, addressing the practicality challenge of manual annotation. It compares a BERT-based semantic similarity baseline with an iterative GPT-4 prompting approach applied to a puzzle task where milestones can be achieved in any order, across 20 groups. The study shows GPT-based prompting can achieve high milestone accuracy, but reveals issues with non-determinism, formatting, and hallucinations, as well as context-window trade-offs. The work demonstrates potential for real-time meeting analytics and informs prompt design strategies for long, multi-speaker transcripts, while highlighting ethical considerations for processing live communications and other multimodal signals in the future.

Abstract

Large language models like GPT have proven widely successful on natural language understanding tasks based on written text documents. In this paper, we investigate an LLM's performance on recordings of a group oral communication task in which utterances are often truncated or not well-formed. We propose a new group task experiment involving a puzzle with several milestones that can be achieved in any order. We investigate methods for processing transcripts to detect if, when, and by whom a milestone has been completed. We demonstrate that iteratively prompting GPT with transcription chunks outperforms semantic similarity search methods using text embeddings, and further discuss the quality and randomness of GPT responses under different context window sizes.

Large Language Models for Automatic Milestone Detection in Group Discussions

TL;DR

The paper tackles automatic milestone detection in group discussions using noisy transcripts, addressing the practicality challenge of manual annotation. It compares a BERT-based semantic similarity baseline with an iterative GPT-4 prompting approach applied to a puzzle task where milestones can be achieved in any order, across 20 groups. The study shows GPT-based prompting can achieve high milestone accuracy, but reveals issues with non-determinism, formatting, and hallucinations, as well as context-window trade-offs. The work demonstrates potential for real-time meeting analytics and informs prompt design strategies for long, multi-speaker transcripts, while highlighting ethical considerations for processing live communications and other multimodal signals in the future.

Abstract

Large language models like GPT have proven widely successful on natural language understanding tasks based on written text documents. In this paper, we investigate an LLM's performance on recordings of a group oral communication task in which utterances are often truncated or not well-formed. We propose a new group task experiment involving a puzzle with several milestones that can be achieved in any order. We investigate methods for processing transcripts to detect if, when, and by whom a milestone has been completed. We demonstrate that iteratively prompting GPT with transcription chunks outperforms semantic similarity search methods using text embeddings, and further discuss the quality and randomness of GPT responses under different context window sizes.
Paper Structure (15 sections, 1 equation, 5 figures, 6 tables)

This paper contains 15 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: A subset of the Cursed Treasure puzzle task.
  • Figure 2: An example of our milestone detection prompt to ChatGPT (not including the full transcription of the group meeting in this figure). In the structured output, "one, dual, quadruple, octopus, hex" are the five milestones for the individual curses, and "solution" is the final answer. In this example, ChatGPT accurately found all of the correct milestone sentences.
  • Figure 3: Flowchart of the proposed iterative query. The puzzle information, request, and update rule are fixed in each iteration. We use a new section of the transcriptions in each loop until all transcriptions are processed. Blue characters in the response window mean GPT has found a better match for that milestone.
  • Figure 4: Frequency histogram of cosine similarity for "octopus". The x-axis is the score ranging from 0 to 1, and the y-axis is the log-scaled frequency. Relevant (milestone) and irrelevant sentences (background) are represented in blue and yellow, respectively.
  • Figure 5: Confusion matrix of the proposed evaluation method. Each team is labeled as failed (0) or solved (1) for each milestone, which produces the team-level true negatives, false negatives, and false positives. When the prediction and ground truth say a team has solved a milestone, the proposed sentences must be checked to determine whether they are the true answers. If not, the situation is considered as a sentence-level false positive.