Table of Contents
Fetching ...

Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation

Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Ru Peng, Zenan Huang, Haokai Xu, Yixin Chen, Jian Wu, Junbo Zhao, Zuozhu Liu

Abstract

Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data. However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to ensure. Recent work uses synthetic data, typically by prompting a generator over domain documents and filtering outputs with handcrafted rubrics. Yet rubric design is expert-dependent, transfers poorly across domains, and is often optimized through a brittle heuristic loop of writing rubrics, synthesizing data, training, inspecting results, and manually guessing revisions. This process lacks reliable quantitative feedback about how a rubric affects downstream performance. We propose evaluating synthetic data by its training utility on the target model and using this signal to guide data generation. Inspired by influence estimation, we adopt an optimizer-aware estimator that uses gradient information to quantify each synthetic sample's contribution to a target model's objective on specific tasks. Our analysis shows that even when synthetic and real samples are close in embedding space, their influence on learning can differ substantially. Based on this insight, we propose an optimization-based framework that adapts rubrics using target-model feedback. We provide lightweight guiding text and use a rubric-specialized model to generate task-conditioned rubrics. Influence score is used as the reward to optimize the rubric generator with reinforcement learning. Experiments across domains, target models, and data generators show consistent improvements and strong generalization without task-specific tuning.

Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation

Abstract

Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data. However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to ensure. Recent work uses synthetic data, typically by prompting a generator over domain documents and filtering outputs with handcrafted rubrics. Yet rubric design is expert-dependent, transfers poorly across domains, and is often optimized through a brittle heuristic loop of writing rubrics, synthesizing data, training, inspecting results, and manually guessing revisions. This process lacks reliable quantitative feedback about how a rubric affects downstream performance. We propose evaluating synthetic data by its training utility on the target model and using this signal to guide data generation. Inspired by influence estimation, we adopt an optimizer-aware estimator that uses gradient information to quantify each synthetic sample's contribution to a target model's objective on specific tasks. Our analysis shows that even when synthetic and real samples are close in embedding space, their influence on learning can differ substantially. Based on this insight, we propose an optimization-based framework that adapts rubrics using target-model feedback. We provide lightweight guiding text and use a rubric-specialized model to generate task-conditioned rubrics. Influence score is used as the reward to optimize the rubric generator with reinforcement learning. Experiments across domains, target models, and data generators show consistent improvements and strong generalization without task-specific tuning.

Paper Structure

This paper contains 61 sections, 9 equations, 20 figures, 9 tables, 1 algorithm.

Figures (20)

  • Figure 1: Visualization of influence score as an effective feature for assessing sample utility. (a) Relationship between validation influence scores and training performance. (b) Distribution of three validation sets in the embedding space of Qwen3-8B-Base. (c) Distribution of three validation sets in the loss–gradient embedding space, showing closer alignment and shared optimization directions.
  • Figure 2: The proposed RL framework for synthetic data generation with optimizer-aware rewards.(a) RL workflow. Starting from a seed document, a prompter instantiates $G$ seed-specific rubrics, and a generator produces synthetic QA pair. (b) Reward computation. We employ the influence score for each synthetic QA pair as reward, followed by policy update.
  • Figure 3: Domain distribution of seed documents used to synthesize question-answering pairs: Humanities and Social Sciences and Medical and Health.
  • Figure 4: The performance of our framework across different target model scales and families.
  • Figure 5: The performance of our framework across different generators.
  • ...and 15 more figures