A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses

Xiangxiang Dai; Yuejin Xie; Maoli Liu; Xuchuang Wang; Zhuohua Li; Huanyu Wang; John C. S. Lui

A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses

Xiangxiang Dai, Yuejin Xie, Maoli Liu, Xuchuang Wang, Zhuohua Li, Huanyu Wang, John C. S. Lui

TL;DR

MACO addresses the challenge of online evaluation and selection of LLM responses to align with user preferences in a distributed, multi-device setting. It introduces MACO-A and MACO-S, combining local online elimination with cloud-based adaptive preference learning via adaptive key-term conversations. Theoretical results show near-minimax regret $R_M(T)=\mathcal{O}(\sqrt{dMT\log \frac{AM\log T}{\delta}})$ and communication cost $\mathcal{O}(d^2M\log T)$, while empirical results on Google/OpenAI embeddings with Llama and GPT-4o demonstrate consistent improvements over baselines and reduced overhead. This work enables scalable, privacy-conscious online LLM response selection with user-aligned personalization across multi-device deployments.

Abstract

Prompt-based offline methods are commonly used to optimize large language model (LLM) responses, but evaluating these responses is computationally intensive and often fails to accommodate diverse response styles. This study introduces a novel online evaluation framework that employs a multi-agent conversational bandit model to select optimal responses while aligning with user preferences dynamically. To tackle challenges such as high-dimensional features, large response sets, adaptive conversational needs, and multi-device access, we propose MACO, Multi-Agent Conversational Online Learning, which comprises two key components: (1) \texttt{MACO-A}: Executed by local agents, it employs an online elimination mechanism to filter out low-quality responses. (2) \texttt{MACO-S}: Executed by the cloud server, it adaptively adjusts selection strategies based on aggregated preference data. An adaptive preference mechanism triggers asynchronous conversations to enhance alignment efficiency. Theoretical analysis demonstrates that MACO achieves near-optimal regret bounds, matching state-of-the-art performance in various degenerate cases. Extensive experiments utilizing Google and OpenAI text embedding models on the real-world datasets with different response styles, combined with Llama and GPT-4o, show that MACO consistently outperforms baseline methods by at least 8.29\% across varying response set sizes and numbers of agents.

A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses

TL;DR

and communication cost

, while empirical results on Google/OpenAI embeddings with Llama and GPT-4o demonstrate consistent improvements over baselines and reduced overhead. This work enables scalable, privacy-conscious online LLM response selection with user-aligned personalization across multi-device deployments.

Abstract

Paper Structure (23 sections, 6 theorems, 5 equations, 6 figures, 6 tables, 2 algorithms)

This paper contains 23 sections, 6 theorems, 5 equations, 6 figures, 6 tables, 2 algorithms.

Introduction
System Model
Online LLM Response Identification
Multi-Agent User-Personalized Bandits
Conversational Contextual Mechanism
Distributed Communication Model
Algorithm Design
MACO Algorithm on Local Agent
MACO Algorithm on Cloud Server
Advantages over Phase Elimination Bandit
Performance Analysis
Performance Evaluation
Experimental Settings
Evaluation Results
Related Work
...and 8 more sections

Key Result

theorem 1

We have the following upper and lower regret bounds:

Figures (6)

Figure 1: Evaluation aligned with online user feedback.
Figure 2: Multi-agent conversational bandit framework for online selecting LLM responses: Local agents handle response selection (arms), while a central server manages conversation flow through key term selection. Server aggregates interaction data across multiple agents to accelerate user preference learning.
Figure 3: Regret on embedding models from Google and OpenAI across different arm pool sizes $A$.
Figure 4: A sample LLM response conversation for user-aligned evaluation.
Figure 5: Regret on embedding models from Google and OpenAI across different agent counts $M$.
...and 1 more figures

Theorems & Definitions (16)

Remark 1
theorem 1: Regret Bounds
Remark 2
theorem 2: Communication Cost
Remark 3
theorem 3: Bound on Conversation Frequency
Remark 4
proof
lemma 1: Stability of the Information Matrix
proof
...and 6 more

A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses

TL;DR

Abstract

A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (16)