Table of Contents
Fetching ...

Mars-PO: Multi-Agent Reasoning System Preference Optimization

Xiaoxuan Lou, Chaojie Wang, Bo An

TL;DR

Mars-PO introduces a multi-agent extension of Direct Preference Optimization to boost mathematical reasoning in instruction-tuned LLMs. By generating diverse responses across agents, selecting a high-quality hybrid positive set, and pairing it with agent-specific negatives, Mars-PO trains agents to align with robust reasoning patterns. Empirical results on GSM8K and the MATH benchmark show substantial gains (e.g., Llama3.1-Instruct improves from 50.38% to 57.82% on MATH), outperforming vanilla DPO, DPO+NLL, and SFT baselines. The work demonstrates that coordinated multi-agent reasoning and carefully constructed preference data can meaningfully enhance complex arithmetic and problem-solving capabilities in LLMs, with practical implications for reliable mathematical reasoning systems.

Abstract

Mathematical reasoning is a fundamental capability for large language models (LLMs), yet achieving high performance in this domain remains a significant challenge. The auto-regressive generation process often makes LLMs susceptible to errors, hallucinations, and inconsistencies, particularly during multi-step reasoning. In this paper, we propose Mars-PO, a novel framework to improve the mathematical reasoning capabilities of LLMs through a multi-agent system. It combines high-quality outputs from multiple agents into a hybrid positive sample set and pairs them with agent-specific negative samples to construct robust preference pairs for training. By aligning agents with shared positive samples while addressing individual weaknesses, Mars-PO achieves substantial performance improvements on mathematical reasoning benchmarks. For example, it increases the accuracy on the MATH benchmark of the state-of-the-art instruction-tuned LLM, Llama3.1-8B-Instruct, from 50.38% to 57.82%. Experimental results further demonstrate that our method consistently outperforms other baselines, such as supervised fine-tuning, vanilla DPO, and its enhanced versions, highlighting the effectiveness of our approach.

Mars-PO: Multi-Agent Reasoning System Preference Optimization

TL;DR

Mars-PO introduces a multi-agent extension of Direct Preference Optimization to boost mathematical reasoning in instruction-tuned LLMs. By generating diverse responses across agents, selecting a high-quality hybrid positive set, and pairing it with agent-specific negatives, Mars-PO trains agents to align with robust reasoning patterns. Empirical results on GSM8K and the MATH benchmark show substantial gains (e.g., Llama3.1-Instruct improves from 50.38% to 57.82% on MATH), outperforming vanilla DPO, DPO+NLL, and SFT baselines. The work demonstrates that coordinated multi-agent reasoning and carefully constructed preference data can meaningfully enhance complex arithmetic and problem-solving capabilities in LLMs, with practical implications for reliable mathematical reasoning systems.

Abstract

Mathematical reasoning is a fundamental capability for large language models (LLMs), yet achieving high performance in this domain remains a significant challenge. The auto-regressive generation process often makes LLMs susceptible to errors, hallucinations, and inconsistencies, particularly during multi-step reasoning. In this paper, we propose Mars-PO, a novel framework to improve the mathematical reasoning capabilities of LLMs through a multi-agent system. It combines high-quality outputs from multiple agents into a hybrid positive sample set and pairs them with agent-specific negative samples to construct robust preference pairs for training. By aligning agents with shared positive samples while addressing individual weaknesses, Mars-PO achieves substantial performance improvements on mathematical reasoning benchmarks. For example, it increases the accuracy on the MATH benchmark of the state-of-the-art instruction-tuned LLM, Llama3.1-8B-Instruct, from 50.38% to 57.82%. Experimental results further demonstrate that our method consistently outperforms other baselines, such as supervised fine-tuning, vanilla DPO, and its enhanced versions, highlighting the effectiveness of our approach.

Paper Structure

This paper contains 18 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Mars-PO Framework. Our preference optimization method consists of three steps: (i) Response Samples Generation: training prompts are fed into the multi-agent system to generate candidate responses, which are then classified as positive or negative for each agent based on answer correctness. (ii) Positive Pairs Construction: positive samples from all agents are evaluated by a reward model to distill a high-quality positive sample set (PS) for the entire system, while negative samples (NS) proceed directly to the next step. (iii) Hybrid Preference Optimization: preference pairs are selected to perform Mars-PO for each agent, supplemented by NLL loss and optional iterative training to improve model robustness and performance.
  • Figure 2: Accuracy of iterative Mars-PO training on GSM8K and Math.
  • Figure 3: Accuracy comparison between vanilla DPO and Mars-PO. Solid lines represent results of Mars-PO method, while dashed lines represent results of traditional DPO method.