Table of Contents
Fetching ...

Can AI agents understand spoken conversations about data visualizations in online meetings?

Rizul Sharma, Tianyu Jiang, Seokki Lee, Jillian Aurisano

TL;DR

The paper addresses how AI agents can understand spoken conversations about data visualizations in online meetings. It introduces a dual-axis evaluation framework and a corpus of 72 dialogues with 318 benchmark questions to assess comprehension across LLM and VLM pipelines and different input formats. It finds that text-only inputs, especially text-only LLM pipelines, achieve the highest accuracy (about 96%), while multimodal inputs show mixed or weaker performance (e.g., 72.4% for text-only VLM, 70% image-only, 68% hybrid). The results highlight the importance of text-based visualization metadata for effective AI-assisted meeting support and inform the design of future AI meeting tools.

Abstract

In this short paper, we present work evaluating an AI agent's understanding of spoken conversations about data visualizations in an online meeting scenario. There is growing interest in the development of AI-assistants that support meetings, such as by providing assistance with tasks or summarizing a discussion. The quality of this support depends on a model that understands the conversational dialogue. To evaluate this understanding, we introduce a dual-axis testing framework for diagnosing the AI agent's comprehension of spoken conversations about data. Using this framework, we designed a series of tests to evaluate understanding of a novel corpus of 72 spoken conversational dialogues about data visualizations. We examine diverse pipelines and model architectures, LLM vs VLM, and diverse input formats for visualizations (the chart image, its underlying source code, or a hybrid of both) to see how this affects model performance on our tests. Using our evaluation methods, we found that text-only input modalities achieved the best performance (96%) in understanding discussions of visualizations in online meetings.

Can AI agents understand spoken conversations about data visualizations in online meetings?

TL;DR

The paper addresses how AI agents can understand spoken conversations about data visualizations in online meetings. It introduces a dual-axis evaluation framework and a corpus of 72 dialogues with 318 benchmark questions to assess comprehension across LLM and VLM pipelines and different input formats. It finds that text-only inputs, especially text-only LLM pipelines, achieve the highest accuracy (about 96%), while multimodal inputs show mixed or weaker performance (e.g., 72.4% for text-only VLM, 70% image-only, 68% hybrid). The results highlight the importance of text-based visualization metadata for effective AI-assisted meeting support and inform the design of future AI meeting tools.

Abstract

In this short paper, we present work evaluating an AI agent's understanding of spoken conversations about data visualizations in an online meeting scenario. There is growing interest in the development of AI-assistants that support meetings, such as by providing assistance with tasks or summarizing a discussion. The quality of this support depends on a model that understands the conversational dialogue. To evaluate this understanding, we introduce a dual-axis testing framework for diagnosing the AI agent's comprehension of spoken conversations about data. Using this framework, we designed a series of tests to evaluate understanding of a novel corpus of 72 spoken conversational dialogues about data visualizations. We examine diverse pipelines and model architectures, LLM vs VLM, and diverse input formats for visualizations (the chart image, its underlying source code, or a hybrid of both) to see how this affects model performance on our tests. Using our evaluation methods, we found that text-only input modalities achieved the best performance (96%) in understanding discussions of visualizations in online meetings.

Paper Structure

This paper contains 11 sections, 2 figures.

Figures (2)

  • Figure 1: High-level flow for pipeline operations: (A) Inputs (B) Four Pipeline Modes (C) Standardized Comprehension Tasks
  • Figure 2: (top) Heatmap - Performance on Complexity Levels. (bottom) Heatmap - Performance on Topic Tags