Table of Contents
Fetching ...

MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification

Saptarshi Sengupta, Harsh Vashistha, Kristal Curtis, Akshay Mallipeddi, Abhinav Mathur, Joseph Ross, Liang Gou

TL;DR

MAG-V introduces a deterministic, multi-agent workflow to generate synthetic customer queries and to verify agent tool-usage trajectories without relying on LLM-based judges. By combining a data-generation pipeline with distant-supervision-based trajectory verification and multiple discriminative ML models, it demonstrates competitive performance against GPT-based baselines and highlights cost and determinism advantages. The work advances scalable, aligned agent systems by providing synthetic data and a robust verification mechanism, with future plans to scale data, ground trajectories to questions, and refine labeling. Overall, MAG-V contributes a cohesive framework for reliable, data-efficient agent development in privacy-sensitive domains.

Abstract

Extending the capabilities of Large Language Models (LLMs) with functions or tools for environment interaction has led to the emergence of the agent paradigm. In industry, training an LLM is not always feasible because of the scarcity of domain data, legal holds on proprietary customer data, rapidly changing business requirements, and the need to prototype new assistants. Agents provide an elegant solution to the above by relying on the zero-shot reasoning abilities of the underlying LLM and utilizing tools to explore and reason over customer data and respond to user requests. However, there are two concerns here: (I) acquiring large scale customer queries for agent testing is time-consuming, and (II) high reliance on the tool call sequence (or trajectory) followed by the agent to respond to user queries may lead to unexpected or incorrect behavior. To address this, we propose MAG-V, a multi-agent framework to first generate a dataset of questions that mimic customer queries; and second, reverse-engineer alternate questions from the responses for trajectory verification. Initial results indicate that our synthetic data can improve agent performance on actual customer queries. Furthermore, our trajectory verification methodology, inspired by distant supervision and using traditional machine learning (ML) models, outperforms a GPT-4o judge baseline by 11% accuracy and matches the performance of a GPT-4 judge on our constructed dataset. Overall, our approach is a step towards unifying diverse task agents into a cohesive framework for achieving an aligned objective.

MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification

TL;DR

MAG-V introduces a deterministic, multi-agent workflow to generate synthetic customer queries and to verify agent tool-usage trajectories without relying on LLM-based judges. By combining a data-generation pipeline with distant-supervision-based trajectory verification and multiple discriminative ML models, it demonstrates competitive performance against GPT-based baselines and highlights cost and determinism advantages. The work advances scalable, aligned agent systems by providing synthetic data and a robust verification mechanism, with future plans to scale data, ground trajectories to questions, and refine labeling. Overall, MAG-V contributes a cohesive framework for reliable, data-efficient agent development in privacy-sensitive domains.

Abstract

Extending the capabilities of Large Language Models (LLMs) with functions or tools for environment interaction has led to the emergence of the agent paradigm. In industry, training an LLM is not always feasible because of the scarcity of domain data, legal holds on proprietary customer data, rapidly changing business requirements, and the need to prototype new assistants. Agents provide an elegant solution to the above by relying on the zero-shot reasoning abilities of the underlying LLM and utilizing tools to explore and reason over customer data and respond to user requests. However, there are two concerns here: (I) acquiring large scale customer queries for agent testing is time-consuming, and (II) high reliance on the tool call sequence (or trajectory) followed by the agent to respond to user queries may lead to unexpected or incorrect behavior. To address this, we propose MAG-V, a multi-agent framework to first generate a dataset of questions that mimic customer queries; and second, reverse-engineer alternate questions from the responses for trajectory verification. Initial results indicate that our synthetic data can improve agent performance on actual customer queries. Furthermore, our trajectory verification methodology, inspired by distant supervision and using traditional machine learning (ML) models, outperforms a GPT-4o judge baseline by 11% accuracy and matches the performance of a GPT-4 judge on our constructed dataset. Overall, our approach is a step towards unifying diverse task agents into a cohesive framework for achieving an aligned objective.

Paper Structure

This paper contains 12 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of MAG-V [AQ = Alternate Question.]
  • Figure 2: Accuracy (left) and F1 (right) scores from all ML models using all features v/s GPT-baselines.