Table of Contents
Fetching ...

TinyAgent: Function Calling at the Edge

Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Hooper, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami

TL;DR

The paper tackles edge deployment of function-calling agents by training small LMs (TinyAgent-1.1B and 7B) with a curated, high-quality dataset to perform task planning and tool orchestration. It introduces an LLMCompiler-guided teaching approach, a Tool RAG retrieval mechanism, and 4-bit quantization to enable fast, private, on-device inference on MacOS. The TinyAgent models achieve function-calling success surpassing GPT-4-Turbo on the driving task while running entirely locally, and the authors release the dataset, models, and a package for public use. Overall, the work demonstrates a practical pipeline for building high-performance, privacy-preserving edge agents with open-source components and deployable on consumer hardware.

Abstract

Recent large language models (LLMs) have enabled the development of advanced agentic systems that can integrate various tools and APIs to fulfill user queries through function calling. However, the deployment of these LLMs on the edge has not been explored since they typically require cloud-based infrastructure due to their substantial model size and computational demands. To this end, we present TinyAgent, an end-to-end framework for training and deploying task-specific small language model agents capable of function calling for driving agentic systems at the edge. We first show how to enable accurate function calling for open-source models via the LLMCompiler framework. We then systematically curate a high-quality dataset for function calling, which we use to fine-tune two small language models, TinyAgent-1.1B and 7B. For efficient inference, we introduce a novel tool retrieval method to reduce the input prompt length and utilize quantization to further accelerate the inference speed. As a driving application, we demonstrate a local Siri-like system for Apple's MacBook that can execute user commands through text or voice input. Our results show that our models can achieve, and even surpass, the function-calling capabilities of larger models like GPT-4-Turbo, while being fully deployed at the edge. We open-source our dataset, models, and installable package and provide a demo video for our MacBook assistant agent.

TinyAgent: Function Calling at the Edge

TL;DR

The paper tackles edge deployment of function-calling agents by training small LMs (TinyAgent-1.1B and 7B) with a curated, high-quality dataset to perform task planning and tool orchestration. It introduces an LLMCompiler-guided teaching approach, a Tool RAG retrieval mechanism, and 4-bit quantization to enable fast, private, on-device inference on MacOS. The TinyAgent models achieve function-calling success surpassing GPT-4-Turbo on the driving task while running entirely locally, and the authors release the dataset, models, and a package for public use. Overall, the work demonstrates a practical pipeline for building high-performance, privacy-preserving edge agents with open-source components and deployable on consumer hardware.

Abstract

Recent large language models (LLMs) have enabled the development of advanced agentic systems that can integrate various tools and APIs to fulfill user queries through function calling. However, the deployment of these LLMs on the edge has not been explored since they typically require cloud-based infrastructure due to their substantial model size and computational demands. To this end, we present TinyAgent, an end-to-end framework for training and deploying task-specific small language model agents capable of function calling for driving agentic systems at the edge. We first show how to enable accurate function calling for open-source models via the LLMCompiler framework. We then systematically curate a high-quality dataset for function calling, which we use to fine-tune two small language models, TinyAgent-1.1B and 7B. For efficient inference, we introduce a novel tool retrieval method to reduce the input prompt length and utilize quantization to further accelerate the inference speed. As a driving application, we demonstrate a local Siri-like system for Apple's MacBook that can execute user commands through text or voice input. Our results show that our models can achieve, and even surpass, the function-calling capabilities of larger models like GPT-4-Turbo, while being fully deployed at the edge. We open-source our dataset, models, and installable package and provide a demo video for our MacBook assistant agent.
Paper Structure (14 sections, 4 figures, 2 tables)

This paper contains 14 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the LLMCompiler Function Calling Planner. The Planner understands the user query and generates a sequence of tasks with their inter-dependencies. These tasks are then dispatched by the LLMCompiler framework to accomplish the user command. In this example, Task $1 and $2 are fetched together to retrieve the email addresses of Sid and Lutfi independently. After each task is performed, the results are forwarded to Task $3 which creates the calendar event. Before executing Task $3, LLMCompiler replaces the placeholder variables (e.g., the variable $1 and $2 in Task $3) with actual values.
  • Figure 2: Graph Isomorphism Success Rate. The model scores a success rate of 1 only if the DAG of its generated plan is isomorphic to the DAG of the ground truth plan; and 0 otherwise. In the above example, for the top case, although the order of the get_email_address calls are different from the ground truth plan (the ground truth plan gets the email address of Lutfi before Sid, and the generated plan gets the email address of Sid before Lutfi), since the two DAGs are isomorphic to each other, the plan gets 1 success rate. For the bottom case, since the predicted DAG contains a wrong node, corresponding to a wrong function call, the plan gets 0 success rate.
  • Figure 3: Overview of our Tool RAG scheme. We formulate tool retrieval as a multi-label classification problem. The user query is given as input to the fine-tuned DeBERTa-v3-small model, which outputs a 16-dimensional vector indicating tool probabilities. Tools with probabilities higher than 50% are selected, averaging 3.97 tools per query compared to 6 tools in basic RAG.
  • Figure 4: Efficient tool selection based on a user input. Not all user inputs require all available tools; hence, it is imperative to select the right set of tools to minimize the prompt size and increase performance. In this case, the LLM only needs the functions that get email addresses and create a calendar event to accomplish its task.