Table of Contents
Fetching ...

MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models

Justin Chih-Yao Chen, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal

TL;DR

MAGDi tackles the high cost of multi-agent reasoning by distilling interactions among multiple LLMs into a single, smaller model. It represents teacher discussions as Multi-Agent Interaction Graphs (MAGs), a DAG where nodes are reasoning outputs and edges encode the refinement process across rounds, and augments a base LM with a Graph Neural Network to learn structure-aware representations. The method optimizes three objectives—learning from correct reasoning ($\mathcal{L}^+$), learning from incorrect reasoning via a contrastive loss ($\mathcal{L}^-$), and learning from the interaction structure via $\mathcal{L}_I$) to produce a final $\mathcal{L}_{MAG}$ that guides distillation. Experiments across seven commonsense and math benchmarks show MAGDi outperforms single-teacher baselines and multi-teacher baselines that do not model structure, while delivering up to 9x test-time efficiency gains over multi-agent systems; MAGDi also demonstrates good generalization to OOD tasks and scales with larger base models.

Abstract

Multi-agent interactions between Large Language Model (LLM) agents have shown major improvements on diverse reasoning tasks. However, these involve long generations from multiple models across several rounds, making them expensive. Moreover, these multi-agent approaches fail to provide a final, single model for efficient inference. To address this, we introduce MAGDi, a new method for structured distillation of the reasoning interactions between multiple LLMs into smaller LMs. MAGDi teaches smaller models by representing multi-agent interactions as graphs, augmenting a base student model with a graph encoder, and distilling knowledge using three objective functions: next-token prediction, a contrastive loss between correct and incorrect reasoning, and a graph-based objective to model the interaction structure. Experiments on seven widely used commonsense and math reasoning benchmarks show that MAGDi improves the reasoning capabilities of smaller models, outperforming several methods that distill from a single teacher and multiple teachers. Moreover, MAGDi also demonstrates an order of magnitude higher efficiency over its teachers. We conduct extensive analyses to show that MAGDi (1) enhances the generalizability to out-of-domain tasks, (2) scales positively with the size and strength of the base student model, and (3) obtains larger improvements (via our multi-teacher training) when applying self-consistency -- an inference technique that relies on model diversity.

MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models

TL;DR

MAGDi tackles the high cost of multi-agent reasoning by distilling interactions among multiple LLMs into a single, smaller model. It represents teacher discussions as Multi-Agent Interaction Graphs (MAGs), a DAG where nodes are reasoning outputs and edges encode the refinement process across rounds, and augments a base LM with a Graph Neural Network to learn structure-aware representations. The method optimizes three objectives—learning from correct reasoning (), learning from incorrect reasoning via a contrastive loss (), and learning from the interaction structure via ) to produce a final that guides distillation. Experiments across seven commonsense and math benchmarks show MAGDi outperforms single-teacher baselines and multi-teacher baselines that do not model structure, while delivering up to 9x test-time efficiency gains over multi-agent systems; MAGDi also demonstrates good generalization to OOD tasks and scales with larger base models.

Abstract

Multi-agent interactions between Large Language Model (LLM) agents have shown major improvements on diverse reasoning tasks. However, these involve long generations from multiple models across several rounds, making them expensive. Moreover, these multi-agent approaches fail to provide a final, single model for efficient inference. To address this, we introduce MAGDi, a new method for structured distillation of the reasoning interactions between multiple LLMs into smaller LMs. MAGDi teaches smaller models by representing multi-agent interactions as graphs, augmenting a base student model with a graph encoder, and distilling knowledge using three objective functions: next-token prediction, a contrastive loss between correct and incorrect reasoning, and a graph-based objective to model the interaction structure. Experiments on seven widely used commonsense and math reasoning benchmarks show that MAGDi improves the reasoning capabilities of smaller models, outperforming several methods that distill from a single teacher and multiple teachers. Moreover, MAGDi also demonstrates an order of magnitude higher efficiency over its teachers. We conduct extensive analyses to show that MAGDi (1) enhances the generalizability to out-of-domain tasks, (2) scales positively with the size and strength of the base student model, and (3) obtains larger improvements (via our multi-teacher training) when applying self-consistency -- an inference technique that relies on model diversity.
Paper Structure (20 sections, 5 equations, 8 figures, 13 tables)

This paper contains 20 sections, 5 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Overview of our distillation method. Given a reasoning problem, multiple teacher-LLMs engage in a multi-round discussion, leading to the generation of a multi-agent interaction graph (MAG). Then our structured distillation method, MAGDi distills reasoning knowledge from these graphs into a base student model.
  • Figure 2: Left (a): Illustration of a Multi-Agent Interaction Graph (MAG) constructed with GPT4, Bard, and Claude2 collaboratively solving a math reasoning problem over three discussion rounds. Right (b-e): The four levels that characterize our structured distillation method (MAGDi); each level progressively distills knowledge from the highlighted components of a MAG.
  • Figure 3: Training Data Construction: Given a reasoning problem, multiple teachers go through a multi-round discussion process, generating multi-agent interaction graphs (MAGs). MAGDi: Our structured distillation method augments a base student model with a Graph Neural Network (specifically, a GCN) to learn structure-aware representations of reasoning chains. The resultant model is then fine-tuned with a combination of three objectives involving positive chains, negative chains, and the underlying interactions.
  • Figure 4: Trade-off between performance and efficiency. MAGDi exceeds the Pareto frontier of prior work, surpassing single-teacher models in performance and surpassing ReConcile in efficiency, defined as $1/avg(tokens)$.
  • Figure 5: Scaling results of MAGDi with different base student models. As the average (zero-shot) performance of the base model improves (Mistral-7B $>$ LLaMA-2-13B$>$ LLaMA-2-7B), MAGDi shows a corresponding increase.
  • ...and 3 more figures