Table of Contents
Fetching ...

One Agent Too Many: User Perspectives on Approaches to Multi-agent Conversational AI

Christopher Clarke, Karthik Krishnamurthy, Walter Talamonti, Yiping Kang, Lingjia Tang, Jason Mars

TL;DR

This study compares two interaction paradigms for multi-agent conversational AI: a single unified orchestration (One For All) and a user-driven agent selection (Agent Select). Through two prototypes evaluated by 19 participants across 10 domains, the authors show that abstracted orchestration yields superior system usability and task performance, with One For All delivering quality responses within 1% of human-selected answers. The work provides end-to-end implementations and two crowdsourced datasets to enable further exploration of multi-agent conversational interfaces. It highlights practical design considerations such as avoiding non-desirable agent responses, maintaining modular agnosticism, and handling domain overlap. Overall, the findings advocate for orchestration-first approaches to simplify cross-domain task completion while preserving the option for user choice when needed.

Abstract

Conversational agents have been gaining increasing popularity in recent years. Influenced by the widespread adoption of task-oriented agents such as Apple Siri and Amazon Alexa, these agents are being deployed into various applications to enhance user experience. Although these agents promote "ask me anything" functionality, they are typically built to focus on a single or finite set of expertise. Given that complex tasks often require more than one expertise, this results in the users needing to learn and adopt multiple agents. One approach to alleviate this is to abstract the orchestration of agents in the background. However, this removes the option of choice and flexibility, potentially harming the ability to complete tasks. In this paper, we explore these different interaction experiences (one agent for all) vs (user choice of agents) for conversational AI. We design prototypes for each, systematically evaluating their ability to facilitate task completion. Through a series of conducted user studies, we show that users have a significant preference for abstracting agent orchestration in both system usability and system performance. Additionally, we demonstrate that this mode of interaction is able to provide quality responses that are rated within 1% of human-selected answers.

One Agent Too Many: User Perspectives on Approaches to Multi-agent Conversational AI

TL;DR

This study compares two interaction paradigms for multi-agent conversational AI: a single unified orchestration (One For All) and a user-driven agent selection (Agent Select). Through two prototypes evaluated by 19 participants across 10 domains, the authors show that abstracted orchestration yields superior system usability and task performance, with One For All delivering quality responses within 1% of human-selected answers. The work provides end-to-end implementations and two crowdsourced datasets to enable further exploration of multi-agent conversational interfaces. It highlights practical design considerations such as avoiding non-desirable agent responses, maintaining modular agnosticism, and handling domain overlap. Overall, the findings advocate for orchestration-first approaches to simplify cross-domain task completion while preserving the option for user choice when needed.

Abstract

Conversational agents have been gaining increasing popularity in recent years. Influenced by the widespread adoption of task-oriented agents such as Apple Siri and Amazon Alexa, these agents are being deployed into various applications to enhance user experience. Although these agents promote "ask me anything" functionality, they are typically built to focus on a single or finite set of expertise. Given that complex tasks often require more than one expertise, this results in the users needing to learn and adopt multiple agents. One approach to alleviate this is to abstract the orchestration of agents in the background. However, this removes the option of choice and flexibility, potentially harming the ability to complete tasks. In this paper, we explore these different interaction experiences (one agent for all) vs (user choice of agents) for conversational AI. We design prototypes for each, systematically evaluating their ability to facilitate task completion. Through a series of conducted user studies, we show that users have a significant preference for abstracting agent orchestration in both system usability and system performance. Additionally, we demonstrate that this mode of interaction is able to provide quality responses that are rated within 1% of human-selected answers.
Paper Structure (54 sections, 5 figures, 4 tables, 1 algorithm)

This paper contains 54 sections, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of One For All. The user's voice is captured via a microphone and then transcribed into its text equivalent (). The textual query is then passed to each of the conversational assistants in parallel (). When each agent provides a response, all responses and the input query are fed to the ranking engine (), which then embeds them and calculates a semantic relation between the user's query and agent responses. The response with the highest ranking is returned (), then converted to audio and played to the user ().
  • Figure 2: Agent Select Prototype Interface. In contrast to One For All, users are given a choice over the agent they wish to route their query to.
  • Figure 3: The average scores of the participants' feedback across statements in the questionnaire (higher is better). Each Statement S1--S5 corresponds to those listed in Table \ref{['tab:questable']}. All results are significant using the Wilcoxon Signed Rank test ($p<.01$).
  • Figure 4: An MTurk task assignment example. We asked workers to decide which of the candidate responses was the most appropriate for the question/command stated. This setup allowed us to gather human judgments of the most appropriate responses to inquiries, and also to gather how effective our approach is at deciding on the best responses.
  • Figure 5: The distribution of agent response quality across Likert scale data points. One For All outperforms each of the assistants in isolation when producing desirable responses and outperforms all assistants in producing the least amount of responses deemed as completely wrong by the crowd. In addition when compared to the human judgment ground truth One For All is par in selecting the most appropriate agent.