Table of Contents
Fetching ...

Enhancing Reasoning with Collaboration and Memory

Julie Michelman, Nasrin Baratalipour, Matthew Abueg

TL;DR

This work investigates how collaboration among multiple LLM agents, diverse reasoning styles, and memory banks can improve reasoning performance. By introducing varied-context exemplars and a summarizer agent, and by comparing frozen and learned memory with different retrieval strategies, the study reveals that random exemplar retrieval and distributed varied-context perspectives often outperform more principled similarity-based retrieval and homogeneous setups. Analogical prompting demonstrates robustness to memory design, while summarizers tend to aid weaker models more than stronger ones. The findings offer practical guidance for building continuous, memory-augmented, multi-agent reasoning systems and highlight the nuanced interactions between memory, prompting, and collaboration in large language models.

Abstract

We envision a continuous collaborative learning system where groups of LLM agents work together to solve reasoning problems, drawing on memory they collectively build to improve performance as they gain experience. This work establishes the foundations for such a system by studying the interoperability of chain-of-thought reasoning styles, multi-agent collaboration, and memory banks. Extending beyond the identical agents of self-consistency, we introduce varied-context agents with diverse exemplars and a summarizer agent in place of voting. We generate frozen and continuously learned memory banks of exemplars and pair them with fixed, random, and similarity-based retrieval mechanisms. Our systematic study reveals where various methods contribute to reasoning performance of two LLMs on three grounded reasoning tasks, showing that random exemplar selection can often beat more principled approaches, and in some tasks, inclusion of any exemplars serves only to distract both weak and strong models.

Enhancing Reasoning with Collaboration and Memory

TL;DR

This work investigates how collaboration among multiple LLM agents, diverse reasoning styles, and memory banks can improve reasoning performance. By introducing varied-context exemplars and a summarizer agent, and by comparing frozen and learned memory with different retrieval strategies, the study reveals that random exemplar retrieval and distributed varied-context perspectives often outperform more principled similarity-based retrieval and homogeneous setups. Analogical prompting demonstrates robustness to memory design, while summarizers tend to aid weaker models more than stronger ones. The findings offer practical guidance for building continuous, memory-augmented, multi-agent reasoning systems and highlight the nuanced interactions between memory, prompting, and collaboration in large language models.

Abstract

We envision a continuous collaborative learning system where groups of LLM agents work together to solve reasoning problems, drawing on memory they collectively build to improve performance as they gain experience. This work establishes the foundations for such a system by studying the interoperability of chain-of-thought reasoning styles, multi-agent collaboration, and memory banks. Extending beyond the identical agents of self-consistency, we introduce varied-context agents with diverse exemplars and a summarizer agent in place of voting. We generate frozen and continuously learned memory banks of exemplars and pair them with fixed, random, and similarity-based retrieval mechanisms. Our systematic study reveals where various methods contribute to reasoning performance of two LLMs on three grounded reasoning tasks, showing that random exemplar selection can often beat more principled approaches, and in some tasks, inclusion of any exemplars serves only to distract both weak and strong models.

Paper Structure

This paper contains 28 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The multi-agent memory reasoning system. Agents each use the given reasoning style: direct, zero- or few-shot chain-of-thought, or analogical prompting. There is a single greedy agent, several temperature-sample agents, or several agents with varied-context (i.e., different exemplars). Answers are combined by voting or a summarizer agent. The memory bank contains exemplars with correct answers from the training set. It can be a reusable frozen set generated by a single greedy ZCoT agent. Or, it can be continuously learned as the multi-agent/in-context learning/retrieval setup utilizes exemplars added to memory earlier in the training pass. Retrieved exemplars are a small fixed set, randomly sampled, or have questions most similar to the current example. Finally, while few-shot CoT requires exemplars, memory is an optional augmentation for analogical prompting.
  • Figure 2: Diagram of agents' NCoT exemplars from a memory bank for two validation set questions. Illustrates three levels of multi-agent collaboration - single, SC, and varied-context - and two types of memory retrieval - fixed and random. A1 is agent 1, ex A is exemplar A, and q1 is question 1. Fixed retrieval reuses the same exemplars every time, while random retrieval samples new exemplars for each question. SC agents share exemplars within a question and use higher temperature (spiky red border) to vary their responses. Varied-context agents independently sample exemplars, so they can use temperature zero (smooth blue border) greedy decoding.
  • Figure 3: Main Experiments. Accuracy over the validation set for the two models (Pro and Ultra, in columns) and three tasks (FOLIO, RACO, TSO, in rows). The three concrete reasoning tasks under test are of varying difficulties -- FOLIO - a challenging expert-written, open-domain first-order logic task; RACO - Reasoning about Colored Objects BIG-Bench Hard Q&A dataset; and TSO - Tracking Shuffled Objects BIG-Bench Hard text completion dataset. Ultra is a significantly larger model than Pro with generally better reasoning performance. ZCoT and NCoT, generally considered standard practice for strong baselines, are beat in many but not all settings by augmented methods.
  • Figure 4: Analogical Prompting vs Chain-of-Thought. Accuracy over the validation set for the two models and three tasks.
  • Figure 5: More Shots vs Varied-Context Agents. Accuracy over the validation set for the two models and three tasks.
  • ...and 1 more figures