Table of Contents
Fetching ...

AQA: Adaptive Question Answering in a Society of LLMs via Contextual Multi-Armed Bandit

Mohanna Hoveyda, Arjen P. de Vries, Maarten de Rijke, Harrie Oosterhuis, Faegheh Hasibi

TL;DR

This work defines adaptive QA as a contextual multi-armed bandit problem, where the context is defined by the characteristics of the incoming question and the action space consists of potential communication graph configurations among the LLM agents, and trains an optimal mapping between different question types and their corresponding optimal multi-LLM communication graph representation.

Abstract

In question answering (QA), different questions can be effectively addressed with different answering strategies. Some require a simple lookup, while others need complex, multi-step reasoning to be answered adequately. This observation motivates the development of a dynamic method that adaptively selects the most suitable QA strategy for each question, enabling more efficient and effective systems capable of addressing a broader range of question types. To this aim, we build on recent advances in the orchestration of multiple large language models (LLMs) and formulate adaptive QA as a dynamic orchestration challenge. We define this as a contextual multi-armed bandit problem, where the context is defined by the characteristics of the incoming question and the action space consists of potential communication graph configurations among the LLM agents. We then train a linear upper confidence bound model to learn an optimal mapping between different question types and their corresponding optimal multi-LLM communication graph representation. Our experiments show that the proposed solution is viable for adaptive orchestration of a QA system with multiple modules, as it combines the superior performance of more complex strategies while avoiding their costs when simpler strategies suffice.

AQA: Adaptive Question Answering in a Society of LLMs via Contextual Multi-Armed Bandit

TL;DR

This work defines adaptive QA as a contextual multi-armed bandit problem, where the context is defined by the characteristics of the incoming question and the action space consists of potential communication graph configurations among the LLM agents, and trains an optimal mapping between different question types and their corresponding optimal multi-LLM communication graph representation.

Abstract

In question answering (QA), different questions can be effectively addressed with different answering strategies. Some require a simple lookup, while others need complex, multi-step reasoning to be answered adequately. This observation motivates the development of a dynamic method that adaptively selects the most suitable QA strategy for each question, enabling more efficient and effective systems capable of addressing a broader range of question types. To this aim, we build on recent advances in the orchestration of multiple large language models (LLMs) and formulate adaptive QA as a dynamic orchestration challenge. We define this as a contextual multi-armed bandit problem, where the context is defined by the characteristics of the incoming question and the action space consists of potential communication graph configurations among the LLM agents. We then train a linear upper confidence bound model to learn an optimal mapping between different question types and their corresponding optimal multi-LLM communication graph representation. Our experiments show that the proposed solution is viable for adaptive orchestration of a QA system with multiple modules, as it combines the superior performance of more complex strategies while avoiding their costs when simpler strategies suffice.
Paper Structure (25 sections, 4 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 4 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: LinUCB expected rewards for individual agents action Space, dashed lines depict the real rewards per action.
  • Figure 2: LinUCB expected rewards for the collaborative action space, dashed line depicts real reward for the optimal action.
  • Figure 3: LinUCB action selection distribution for the collaborative action space.
  • Figure 4: Edge probabilities distribution among NoR, OneR, IRCoT, and FinalDecision nodes before (left) and after (right) optimization using GPTSwarm. DBLP:journals/corr/abs-2402-16823.