Table of Contents
Fetching ...

NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning

Raviteja Anantha, Soheil Hor, Teodor Nicola Antoniu, Layne C. Price

TL;DR

NanoFlux presents a fully automatic adversarial framework where dual LLMs (Attacker and Defender) generate targeted, multi-hop reasoning questions under the supervision of a tool-augmented Judge. By constraining synthesis to roughly 200 examples per domain, it outperforms traditional full-dataset fine-tuning across GSMHard, GenomeBench, and MultiMedQA while achieving substantial compute savings. The approach leverages embedding-based novelty filtering and domain-specific judge tooling to create high-information training signals and diverse reasoning patterns, yielding domain-agnostic gains and revealing non-monotonic relationships between dataset characteristics and model performance. The work suggests that intelligently synthesized, small training sets can dramatically improve reasoning capabilities with far greater data efficiency than large-scale data collection.

Abstract

We present NanoFlux, a novel adversarial framework for generating targeted training data to improve LLM reasoning, where adversarially-generated datasets containing fewer than 200 examples outperform conventional fine-tuning approaches. The framework employs a competitive dynamic between models alternating as Attacker and Defender, supervised by a tool-augmented Judge, synthesizing multi-step questions with explanatory annotations that target specific reasoning capabilities. Fine-tuning a 4B-parameter model on NanoFlux-generated data yields performance gains across diverse domains compared to full-benchmark fine-tuning: +5.9% on mathematical reasoning (GSMHard), +3.6% on scientific reasoning (GenomeBench), and +16.6% on medical reasoning (MultiMedQA), while reducing computational requirements by 3-14x. Ablation studies reveal a non-monotonic relationship between dataset characteristics and model performance, uncovering domain-specific optimal points for question complexity and reasoning quality. NanoFlux automates training data generation through embedding-based novelty filtering, tool-augmented evaluation, and multi-hop reasoning, suggesting that future model improvements may lie in the intelligent synthesis of small, precisely targeted training datasets.

NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning

TL;DR

NanoFlux presents a fully automatic adversarial framework where dual LLMs (Attacker and Defender) generate targeted, multi-hop reasoning questions under the supervision of a tool-augmented Judge. By constraining synthesis to roughly 200 examples per domain, it outperforms traditional full-dataset fine-tuning across GSMHard, GenomeBench, and MultiMedQA while achieving substantial compute savings. The approach leverages embedding-based novelty filtering and domain-specific judge tooling to create high-information training signals and diverse reasoning patterns, yielding domain-agnostic gains and revealing non-monotonic relationships between dataset characteristics and model performance. The work suggests that intelligently synthesized, small training sets can dramatically improve reasoning capabilities with far greater data efficiency than large-scale data collection.

Abstract

We present NanoFlux, a novel adversarial framework for generating targeted training data to improve LLM reasoning, where adversarially-generated datasets containing fewer than 200 examples outperform conventional fine-tuning approaches. The framework employs a competitive dynamic between models alternating as Attacker and Defender, supervised by a tool-augmented Judge, synthesizing multi-step questions with explanatory annotations that target specific reasoning capabilities. Fine-tuning a 4B-parameter model on NanoFlux-generated data yields performance gains across diverse domains compared to full-benchmark fine-tuning: +5.9% on mathematical reasoning (GSMHard), +3.6% on scientific reasoning (GenomeBench), and +16.6% on medical reasoning (MultiMedQA), while reducing computational requirements by 3-14x. Ablation studies reveal a non-monotonic relationship between dataset characteristics and model performance, uncovering domain-specific optimal points for question complexity and reasoning quality. NanoFlux automates training data generation through embedding-based novelty filtering, tool-augmented evaluation, and multi-hop reasoning, suggesting that future model improvements may lie in the intelligent synthesis of small, precisely targeted training datasets.

Paper Structure

This paper contains 34 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The NanoFlux Adversarial Data Generation Framework. The process begins with random sampling from benchmark datasets (left), followed by the attacker model generating questions through concept stitching. Generated questions undergo embedding-based novelty filtering to ensure diversity, then validation by the judge model equipped with code execution and web search capabilities. The defender model attempts to solve validated questions, with the judge evaluating response correctness. Questions that the defender fails to solve (or solves through novel approaches) are retained in the final NanoFlux dataset, which contains 200 examples per domain.
  • Figure 2: Data Scaling Ablation of NanoFlux Across Three Benchmarks. Performance curves showing the relationship between NanoFlux training dataset size (50--200 datapoints) and model accuracy across GSMHard (mathematical reasoning), GenomeBench (genomics), and MultiMedQA (medical QA) benchmarks. Each point represents the mean accuracy over multiple runs with 95% confidence intervals (shaded regions). GSMHard and GenomeBench exhibit monotonic improvement with diminishing returns, achieving 3.0% and 2.2% accuracy gains respectively from 50 to 200 datapoints. MultiMedQA demonstrates non-monotonic scaling with peak performance at 150 datapoints (69.0%), followed by a 2.6% degradation at 200 datapoints. The results indicate that optimal NanoFlux dataset sizes are benchmark-dependent, with medical domains requiring more careful data curation to avoid performance degradation
  • Figure 3: NanoFlux Sensitivity to Question Complexity and Data Quality Across Benchmarks.Top row: Question complexity ablation showing the effect of seed question count on model accuracy across GSMHard (5--10 seeds), GenomeBench (5--10 seeds), and MultiMedQA (7--12 seeds) benchmarks. GSMHard and GenomeBench exhibit inverse relationships with complexity, achieving peak performance at 6 and 5 seed questions respectively, then declining monotonically (GSMHard: 63.78% $\rightarrow$ 59.45%; GenomeBench: 64.60% $\rightarrow$ 53.87%). MultiMedQA demonstrates higher complexity tolerance, peaking at 9 seed questions (71.10%) before gradual degradation. Bottom row: Data quality ablation across quality levels L1--L5, revealing consistent patterns where L4 represents the optimal quality--performance trade-off. All benchmarks show performance degradation at the highest quality level L5, with MultiMedQA exhibiting the most quality sensitivity (35.71% at L2 $\rightarrow$ 75.02% at L4 $\rightarrow$ 65.97% at L5). The results demonstrate that NanoFlux performance is sensitive to both question complexity and data quality, with domain-specific optimal configurations and consistent evidence that maximum complexity/quality does not guarantee optimal performance.
  • Figure 4: Training and validation loss curves comparing NanoFlux efficiency with full dataset training on MultiMedQA. Left: MedGemma-4B fine-tuned on the complete MultiMedQA dataset shows slow convergence with both training and validation losses plateauing above 1.0 after 20,000 steps. Right: MedGemma-4B fine-tuned on 200 NanoFlux-generated examples achieves faster convergence to a lower loss floor ($\sim$0.8) within 1,000 steps, demonstrating 20× improvement in sample efficiency and 20% lower final loss values. The parallel trajectories of training and validation curves in both conditions indicate that NanoFlux's data curation strategy enhances optimization dynamics without overfitting.
  • Figure 5: Visual representation of reasoning quality levels from L1 (lowest) to L5 (highest)