Large Language Models for Automatic Milestone Detection in Group Discussions
Zhuoxu Duan, Zhengye Yang, Samuel Westby, Christoph Riedl, Brooke Foucault Welles, Richard J. Radke
TL;DR
The paper tackles automatic milestone detection in group discussions using noisy transcripts, addressing the practicality challenge of manual annotation. It compares a BERT-based semantic similarity baseline with an iterative GPT-4 prompting approach applied to a puzzle task where milestones can be achieved in any order, across 20 groups. The study shows GPT-based prompting can achieve high milestone accuracy, but reveals issues with non-determinism, formatting, and hallucinations, as well as context-window trade-offs. The work demonstrates potential for real-time meeting analytics and informs prompt design strategies for long, multi-speaker transcripts, while highlighting ethical considerations for processing live communications and other multimodal signals in the future.
Abstract
Large language models like GPT have proven widely successful on natural language understanding tasks based on written text documents. In this paper, we investigate an LLM's performance on recordings of a group oral communication task in which utterances are often truncated or not well-formed. We propose a new group task experiment involving a puzzle with several milestones that can be achieved in any order. We investigate methods for processing transcripts to detect if, when, and by whom a milestone has been completed. We demonstrate that iteratively prompting GPT with transcription chunks outperforms semantic similarity search methods using text embeddings, and further discuss the quality and randomness of GPT responses under different context window sizes.
