Table of Contents
Fetching ...

DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback

Zaid Khan, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal

TL;DR

DataEnvGym proposes a modular testbed of autonomous data-generation teachers (teachers) that iteratively improve student models by generating targeted training data conditioned on observed weaknesses. The framework separates environment modules (trainer/evaluator, skill discovery, skill organization) from agent modules (data-generation policy and engine), enabling open-ended and structured (skill-list, skill-tree) data generation across math, code, VQA, and tool-use tasks. Empirical results show consistent student improvements across GQA, MATH, LiveCodeBench, NaturalBench, MnMs, with state-conditioned policies outperforming state-agnostic baselines; skill-based environments yield interpretability and curriculum advantages, and larger skill-discovery quality further boosts gains. Overall, DataEnvGym establishes a versatile platform for evaluating and advancing data-generation agents, engines, and feedback mechanisms with strong potential to automate and optimize the data creation loop in real-world model development.

Abstract

The process of creating training data to teach models is currently driven by humans, who manually analyze model weaknesses and plan how to create data that improves a student model. Approaches using LLMs as annotators reduce human effort, but still require humans to interpret feedback from evaluations and control the LLM to produce data the student needs. Automating this labor-intensive process by creating autonomous data generation agents - or teachers - is desirable, but requires environments that can simulate the feedback-driven, iterative, closed loop of data creation. To enable rapid, scalable testing for such agents and their modules, we introduce DataEnvGym, a testbed of teacher environments for data generation agents. DataEnvGym frames data generation as a sequential decision-making task, involving an agent consisting of a data generation policy (which generates a plan for creating training data) and a data generation engine (which transforms the plan into data), inside an environment that provides student feedback. The agent's goal is to improve student performance. Students are iteratively trained and evaluated on generated data, and their feedback (in the form of errors or weak skills) is reported to the agent after each iteration. DataEnvGym includes multiple teacher environment instantiations across 3 levels of structure in the state representation and action space. More structured environments are based on inferred skills and offer more interpretability and curriculum control. We support 4 domains (math, code, VQA, and tool-use) and test multiple students and teachers. Example agents in our teaching environments can iteratively improve students across tasks and settings. Moreover, we show that environments teach different skill levels and test variants of key modules, pointing to future work in improving data generation agents, engines, and feedback mechanisms.

DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback

TL;DR

DataEnvGym proposes a modular testbed of autonomous data-generation teachers (teachers) that iteratively improve student models by generating targeted training data conditioned on observed weaknesses. The framework separates environment modules (trainer/evaluator, skill discovery, skill organization) from agent modules (data-generation policy and engine), enabling open-ended and structured (skill-list, skill-tree) data generation across math, code, VQA, and tool-use tasks. Empirical results show consistent student improvements across GQA, MATH, LiveCodeBench, NaturalBench, MnMs, with state-conditioned policies outperforming state-agnostic baselines; skill-based environments yield interpretability and curriculum advantages, and larger skill-discovery quality further boosts gains. Overall, DataEnvGym establishes a versatile platform for evaluating and advancing data-generation agents, engines, and feedback mechanisms with strong potential to automate and optimize the data creation loop in real-world model development.

Abstract

The process of creating training data to teach models is currently driven by humans, who manually analyze model weaknesses and plan how to create data that improves a student model. Approaches using LLMs as annotators reduce human effort, but still require humans to interpret feedback from evaluations and control the LLM to produce data the student needs. Automating this labor-intensive process by creating autonomous data generation agents - or teachers - is desirable, but requires environments that can simulate the feedback-driven, iterative, closed loop of data creation. To enable rapid, scalable testing for such agents and their modules, we introduce DataEnvGym, a testbed of teacher environments for data generation agents. DataEnvGym frames data generation as a sequential decision-making task, involving an agent consisting of a data generation policy (which generates a plan for creating training data) and a data generation engine (which transforms the plan into data), inside an environment that provides student feedback. The agent's goal is to improve student performance. Students are iteratively trained and evaluated on generated data, and their feedback (in the form of errors or weak skills) is reported to the agent after each iteration. DataEnvGym includes multiple teacher environment instantiations across 3 levels of structure in the state representation and action space. More structured environments are based on inferred skills and offer more interpretability and curriculum control. We support 4 domains (math, code, VQA, and tool-use) and test multiple students and teachers. Example agents in our teaching environments can iteratively improve students across tasks and settings. Moreover, we show that environments teach different skill levels and test variants of key modules, pointing to future work in improving data generation agents, engines, and feedback mechanisms.
Paper Structure (41 sections, 18 figures, 8 tables)

This paper contains 41 sections, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Overview of DataEnvGym, a novel testbed for data generation agents. The environment (left) consists of (a) evaluation and (d) training of the student model. The data generation agent (right) takes a state encoding the current student model's performance and provides training data to improve the student model, by first creating a plan through the (b) data generation policy, then executing the plan via the (c) data generation engine.
  • Figure 2: Illustration of the three example instances of DataEnvGym environments described in \ref{['sec:method']}. In the (a) Open-Ended environment, the state is represented as a list of per-example accuracies, and the data generation policy directly creates a data generation plan from them. In the (b) Skill-List environment, the state is represented as a categorized list of skills and per-skill student model performance; its data generation plan allows the policy to prioritize weak skills. In the (c) Skill-Tree environment, the state is represented as a forest of skill trees containing skill-subskill relational information, and its data generation policy chooses between two actions for each skill: explore (grow skill tree) and exploit (rebalance skill tree).
  • Figure 3: Example skill tree updates over time for MATH task's "Algebra" skill in the Skill-Tree environment. Starting from a empty single node, the data generation policy (\ref{['sec:data_generation_policy']}) iteratively chooses actions between explore (grow skill tree) and exploit (rebalance skill tree). Then the skill organization module (\ref{['sec:skill_organization']}) accordingly adds/removes subskills and re-allocates the training data for each subskill.
  • Figure 4: Per-skill accuracy improvements of Gemma-2B trained on MATH in the Skill-Tree environment, as a function of (a) question difficulty and (b) skill rarity (inverse of frequency) in the training data. The biggest performance increases occur in the middle range for difficulty and in the lower range for rarity (i.e., on more frequent skills).
  • Figure 5: Training dynamics across three tasks. Each line is split into a solid segment terminating in a vertical line (the maximum performance achieved by the student) and a dashed line showing the effect of continued training beyond this point.
  • ...and 13 more figures