Table of Contents
Fetching ...

LLM Agents Making Agent Tools

Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović, Jakob Nikolas Kather

TL;DR

ToolMaker introduces an autonomous framework that converts scientific papers and their code repositories into LLM-compatible tools within a reproducible, Docker-based environment. It combines environment setup, tool implementation, and a closed-loop self-improvement cycle to build multi-step, dependency-rich tools with minimal human intervention. Evaluated on TM-Bench, ToolMaker achieves 80% accuracy across 15 tasks, substantially surpassing the OpenHands baseline and demonstrating the feasibility of fully autonomous, tool-enabled scientific workflows. The work also provides TM-Bench as a public benchmark to accelerate progress in agentic tool creation, while acknowledging safety and reproducibility considerations inherent to automated code and model deployment.

Abstract

Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers, hindering the applicability of LLM agents in domains demanding large numbers of highly specialised tools, like in life sciences and medicine. Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, an agentic framework that autonomously transforms papers with code into LLM-compatible tools. Given a GitHub URL and short task description, ToolMaker autonomously installs dependencies and generates code to perform the task, using a closed-loop self-correction mechanism for debugging. To evaluate our approach, we introduce a benchmark comprising 15 complex computational tasks spanning various domains with over 100 unit tests to assess correctness and robustness. Our method correctly implements 80% of the tasks, substantially outperforming current state-of-the-art software engineering agents. ToolMaker therefore is a step towards fully autonomous agent-based scientific workflows. Our code and benchmark are publicly available at https://github.com/KatherLab/ToolMaker.

LLM Agents Making Agent Tools

TL;DR

ToolMaker introduces an autonomous framework that converts scientific papers and their code repositories into LLM-compatible tools within a reproducible, Docker-based environment. It combines environment setup, tool implementation, and a closed-loop self-improvement cycle to build multi-step, dependency-rich tools with minimal human intervention. Evaluated on TM-Bench, ToolMaker achieves 80% accuracy across 15 tasks, substantially surpassing the OpenHands baseline and demonstrating the feasibility of fully autonomous, tool-enabled scientific workflows. The work also provides TM-Bench as a public benchmark to accelerate progress in agentic tool creation, while acknowledging safety and reproducibility considerations inherent to automated code and model deployment.

Abstract

Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers, hindering the applicability of LLM agents in domains demanding large numbers of highly specialised tools, like in life sciences and medicine. Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, an agentic framework that autonomously transforms papers with code into LLM-compatible tools. Given a GitHub URL and short task description, ToolMaker autonomously installs dependencies and generates code to perform the task, using a closed-loop self-correction mechanism for debugging. To evaluate our approach, we introduce a benchmark comprising 15 complex computational tasks spanning various domains with over 100 unit tests to assess correctness and robustness. Our method correctly implements 80% of the tasks, substantially outperforming current state-of-the-art software engineering agents. ToolMaker therefore is a step towards fully autonomous agent-based scientific workflows. Our code and benchmark are publicly available at https://github.com/KatherLab/ToolMaker.

Paper Structure

This paper contains 51 sections, 11 equations, 5 figures, 13 tables, 1 algorithm.

Figures (5)

  • Figure 1: We envision a future where agents posess dynamic toolsets that can be expanded at runtime. Tool creation, studied here, is a crucial step towards this goal.
  • Figure 2: Given a task description, a scientific paper, a link to the associated code repository, and an example of the tool invocation, ToolMaker creates (i) a Docker container in which the tool can be executed, (ii) a Python function that performs the task.
  • Figure 3: ToolMaker workflow. Given a task description, a scientific paper, and its associated code repository, ToolMaker generates an executable tool that enables a downstream agent to perform the described task.
  • Figure 4: An agent uses a tool-augmented to perform a specific sub-task, and returns the result. Messages are appended to the conversation history, and tool calls enable the agent to interact with the environment.
  • Figure 5: Transitions between tool calls by ToolMaker.