Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models

Anchun Gui; Jian Li; Yong Dai; Nan Du; Han Xiao

Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models

Anchun Gui, Jian Li, Yong Dai, Nan Du, Han Xiao

TL;DR

A decision-aware and generalizable tool-usage framework (DEER), which first construct the tool-usage samples with multiple decision branches via an automatic generation pipeline, thereby inspiring the decision-making awareness of LLMs under diverse scenarios and proposing a novel tool sampling strategy to enhance the generalizability of LLMs over unseen tools.

Abstract

Tool-augmented large language models (LLMs) are attracting widespread attention when accessing up-to-date knowledge and alleviating hallucination issues. Nowadays, advanced closed-source LLMs (e.g., ChatGPT) have demonstrated surprising tool-usage capabilities through prompting and in-context learning techniques. To empower the capabilities of open-source LLMs (e.g., LLaMA) in manipulating tools, current efforts focus on either template-driven or token-triggered tool-usage. However, the former hampers LLMs' flexibility to address diverse user's queries due to constrained tool interactions, while the latter limits the generalizability when engaging with new tools, since tool-usage learning is based on task- and tool-specific datasets. To alleviate these concerns, in this paper, we propose a decision-aware and generalizable tool-usage framework (DEER). Specifically, we first construct the tool-usage samples with multiple decision branches via an automatic generation pipeline, thereby inspiring the decision-making awareness of LLMs under diverse scenarios. Meanwhile, we propose a novel tool sampling strategy to enhance the generalizability of LLMs over unseen tools. Extensive experiments demonstrate that our proposed DEER is effective and significantly outperforms baselines across various datasets.

Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models

TL;DR

Abstract

Paper Structure (26 sections, 2 equations, 18 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 2 equations, 18 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Tool-Augmented Language Models
Tool-Usage with Open-Source LLMs
Methodology
Multi-Decision Branches Design
Automatic Generation Pipeline.
Tool Sampling
Experiments
Setup
Datasets & Evaluation Protocols.
Baselines.
Implementation.
Main Results
The decision awareness of tool-usage.
...and 11 more sections

Figures (18)

Figure 1: Given diverse user's queries, LLMs are expected to make optimal decisions for diverse queries to reduce unnecessary tool-usage and accelerate inference.
Figure 2: Comparison of different tool-usage paradigms. (a) In template-driven tool-usage, for any query, the interaction with tools is continuously implemented following a specific format (e.g., "Thought-Action-Observation") until obtaining the final answer. (b) In token-triggered tool-usage, the tool can be triggered by generating a specific token (e.g., "->") during inference. (c) In our decision-aware tool-usage, we design multiple decision branches (i.e., ①, ②, ③, ④) to address diverse user's queries, deciding whether should search for tools and whether there are suitable tools. Note that the candidate toolset indicates the set of currently provided tools for the model.
Figure 3: Pipeline of our sample generation. Detailed descriptions are presented in Section \ref{['subsec:multi_decision']}. Note that here we simplify the prompt templates and contexts for clarity, and full prompt templates are provided in Appendix \ref{['app:template']}.
Figure 4: Illustration of different sampling strategies.
Figure 5: Comparisons of diverse tool sampling strategies (in the first row) and sampling ratios (in the second row, where "Random : Intra-class : Inter-class" in the $x$-axis). Here, the experiments of the first, second, and third columns are conducted on the NoCall, Call, and Decision-Call scenarios (i.e., $\rm{P}_{NoCall}$, $\rm{P}_{Call}$, $\rm{P}_{DC}$), respectively. Note that the valid(ation) and test sets are built on seen and unseen tools, respectively.
...and 13 more figures

Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models

TL;DR

Abstract

Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (18)