Table of Contents
Fetching ...

FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, Jiang Bian

TL;DR

FT-Agent is developed, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback to iteratively diagnose failures and refine fine-tuning strategies, and generalizes effectively to 3B models.

Abstract

Fine-tuning large language models for vertical domains remains a labor-intensive and expensive process, requiring domain experts to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning, no prior work has tackled end-to-end LLM fine-tuning with agents. Can LLM-based agents automate this complete process? We frame this as a substantially open problem: agents must navigate an open-ended search space spanning data curation from diverse data sources, processing with complex tools, building a training pipeline, and iteratively refining their approach based on evaluation outcomes in rapidly growing logs--an overall scenario far more intricate than existing benchmarks. To study this question, we introduce FT-Dojo, an interactive environment comprising 13 tasks across 5 domains. We further develop FT-Agent, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback to iteratively diagnose failures and refine fine-tuning strategies. Experiments on FT-Dojo demonstrate that purpose-built fine-tuning agents significantly outperform general-purpose alternatives, with FT-Agent achieving the best performance on 10 out of 13 tasks across all five domains. Ablations show that the approach generalizes effectively to 3B models, with additional insights on data scaling trade-offs and backbone sensitivity. Case analyses reveal that agents can recover from failures through cumulative learning from historical experience, while also exposing fundamental limitations in causal reasoning--highlighting both the promise and current boundaries of autonomous LLM fine-tuning.

FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

TL;DR

FT-Agent is developed, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback to iteratively diagnose failures and refine fine-tuning strategies, and generalizes effectively to 3B models.

Abstract

Fine-tuning large language models for vertical domains remains a labor-intensive and expensive process, requiring domain experts to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning, no prior work has tackled end-to-end LLM fine-tuning with agents. Can LLM-based agents automate this complete process? We frame this as a substantially open problem: agents must navigate an open-ended search space spanning data curation from diverse data sources, processing with complex tools, building a training pipeline, and iteratively refining their approach based on evaluation outcomes in rapidly growing logs--an overall scenario far more intricate than existing benchmarks. To study this question, we introduce FT-Dojo, an interactive environment comprising 13 tasks across 5 domains. We further develop FT-Agent, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback to iteratively diagnose failures and refine fine-tuning strategies. Experiments on FT-Dojo demonstrate that purpose-built fine-tuning agents significantly outperform general-purpose alternatives, with FT-Agent achieving the best performance on 10 out of 13 tasks across all five domains. Ablations show that the approach generalizes effectively to 3B models, with additional insights on data scaling trade-offs and backbone sensitivity. Case analyses reveal that agents can recover from failures through cumulative learning from historical experience, while also exposing fundamental limitations in causal reasoning--highlighting both the promise and current boundaries of autonomous LLM fine-tuning.
Paper Structure (67 sections, 4 equations, 6 figures, 9 tables)

This paper contains 67 sections, 4 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of FT-Dojo architecture and agent interaction workflow. Agents operate within an isolated sandbox and interact with three core interfaces: Meta API $\mathcal{I}$ for task and system information, Data Repository $\mathcal{D}$ for training data, and Evaluator $\mathcal{F}$ that returns structured feedback. Agents iteratively query information, submit code and configurations, execute training, and refine based on evaluation feedback until achieving satisfactory performance.
  • Figure 2: Overview of FT-Agent. The agent iteratively explores configurations, learning from failed trials (red) to reach successful ones (green). Each iteration follows three stages: Strategy Proposal, Implementation & Validation, and Feedback Analysis.
  • Figure 3: Ablation study results. (a) Data scaling: Performance comparison between 2k and 5k training samples. (b) Agent backbone: Impact of different LLM backbones on agent performance. (c) Model scale: Performance evaluation across varying model sizes. Full numerical results are provided in Appendix Table \ref{['tab:ablation-val-test']}.
  • Figure 4: Contrasting Learning Trajectories. (a) On Chemistry, the agent demonstrates cumulative learning, recovering from failure and progressively refining its approach through domain tool integration and iterative optimization. (b) On Patent Classification, it exhibits shotgun debugging, cycling through advanced techniques without identifying the root cause of overfitting.
  • Figure 5: Cross-domain data strategies autonomously discovered by FT-Agent. Left: for Numerical Reasoning, the agent mixes TableInstruct (90%) with out-of-domain DeepScaleR (10%). Right: for Mol_Opt, the agent combines three chemistry sub-tasks with task-informed proportions.
  • ...and 1 more figures