Table of Contents
Fetching ...

Topic-Conversation Relevance (TCR) Dataset and Benchmarks

Yaran Fan, Jamie Pool, Senja Filipi, Ross Cutler

TL;DR

A comprehensive Topic-Conversation Relevance (TCR) dataset that covers a variety of domains and meeting styles and benchmarks are created using GPT-4 to evaluate the model accuracy in understanding transcription-topic relevance.

Abstract

Workplace meetings are vital to organizational collaboration, yet a large percentage of meetings are rated as ineffective. To help improve meeting effectiveness by understanding if the conversation is on topic, we create a comprehensive Topic-Conversation Relevance (TCR) dataset that covers a variety of domains and meeting styles. The TCR dataset includes 1,500 unique meetings, 22 million words in transcripts, and over 15,000 meeting topics, sourced from both newly collected Speech Interruption Meeting (SIM) data and existing public datasets. Along with the text data, we also open source scripts to generate synthetic meetings or create augmented meetings from the TCR dataset to enhance data diversity. For each data source, benchmarks are created using GPT-4 to evaluate the model accuracy in understanding transcription-topic relevance.

Topic-Conversation Relevance (TCR) Dataset and Benchmarks

TL;DR

A comprehensive Topic-Conversation Relevance (TCR) dataset that covers a variety of domains and meeting styles and benchmarks are created using GPT-4 to evaluate the model accuracy in understanding transcription-topic relevance.

Abstract

Workplace meetings are vital to organizational collaboration, yet a large percentage of meetings are rated as ineffective. To help improve meeting effectiveness by understanding if the conversation is on topic, we create a comprehensive Topic-Conversation Relevance (TCR) dataset that covers a variety of domains and meeting styles. The TCR dataset includes 1,500 unique meetings, 22 million words in transcripts, and over 15,000 meeting topics, sourced from both newly collected Speech Interruption Meeting (SIM) data and existing public datasets. Along with the text data, we also open source scripts to generate synthetic meetings or create augmented meetings from the TCR dataset to enhance data diversity. For each data source, benchmarks are created using GPT-4 to evaluate the model accuracy in understanding transcription-topic relevance.

Paper Structure

This paper contains 24 sections, 2 figures, 12 tables.

Figures (2)

  • Figure 1: Topic-Conversation Relevance (TCR) Dataset Schema
  • Figure 2: Metadata Schema for Augmented Meetings