Towards Practical Tool Usage for Continually Learning LLMs

Jerry Huang; Prasanna Parthasarathi; Mehdi Rezagholizadeh; Sarath Chandar

Towards Practical Tool Usage for Continually Learning LLMs

Jerry Huang, Prasanna Parthasarathi, Mehdi Rezagholizadeh, Sarath Chandar

TL;DR

This work tackles continual learning for LLMs by integrating tool usage to offset non-stationary knowledge and task distributions. It introduces a synthetic arithmetic benchmark and a GLUE-based continual learning setup to compare tool-enabled and vanilla models across a wide size range, showing that simply scaling parameters does not solve forgetting and that replay buffers with tool use markedly improve both learning speed and retention. The results indicate that tool LLMs can achieve higher learning accuracy with smaller models, but forgetting remains an obstacle without replay, highlighting the need to manage tool non-stationarity and imperfect tools. The discussion connects parametric knowledge utilization, auxiliary tool systems, and the dual nature of forgetting, underscoring practical implications for deploying efficient, continual learners in dynamic real-world environments.

Abstract

Large language models (LLMs) show an innate skill for solving language based tasks. But insights have suggested an inability to adjust for information or task-solving skills becoming outdated, as their knowledge, stored directly within their parameters, remains static in time. Tool use helps by offloading work to systems that the LLM can access through an interface, but LLMs that use them still must adapt to nonstationary environments for prolonged use, as new tools can emerge and existing tools can change. Nevertheless, tools require less specialized knowledge, therefore we hypothesize they are better suited for continual learning (CL) as they rely less on parametric memory for solving tasks and instead focus on learning when to apply pre-defined tools. To verify this, we develop a synthetic benchmark and follow this by aggregating existing NLP tasks to form a more realistic testing scenario. While we demonstrate scaling model size is not a solution, regardless of tool usage, continual learning techniques can enable tool LLMs to both adapt faster while forgetting less, highlighting their potential as continual learners.

Towards Practical Tool Usage for Continually Learning LLMs

TL;DR

Abstract

Paper Structure (55 sections, 8 equations, 7 figures, 11 tables)

This paper contains 55 sections, 8 equations, 7 figures, 11 tables.

Introduction
Related Works
LLMs as Continual Learners.
Efficiently Updating LLMs.
Tool-Augmented LLMs.
Motivating Questions
Methodology
Preliminaries
Model:
Dataset Format:
Learning Setup:
Baselines
Sequential Fine-tuning:
Mixed Dataset:
Episodic Replay (ER):
...and 40 more sections

Figures (7)

Figure 1: CL with Tools - For a task, the model is first trained to predict/generate tool calls, rather than explicit responses. The trained model is then frozen and evaluated, during which it outputs tool calls that are parsed and executed to return an output which is compared against the ground-truth. The model is then unfrozen and trained on the next task in the sequence. This is repeated until all tasks have been learned by the model.
Figure 2: Across the different task setups, we measure the different metrics. Although it is evident that using tools improves the L-Accuracy significantly, we observe that the Accuracy across tasks is not reflecting the same. The significant forgetting of tools only get fixed with appropriate usage of a replay buffer to improve the overall accuracy irrespective of the task difficulty.
Figure 3: Accuracy on all benchmarks when tasks are mixed (5 seeds). Red bars note accuracy without tools, grey bars show the gain from using tools. Top labels show accuracy using tools. Tabular versions of numerical results are available in \ref{['app:details']}.
Figure 4: While we observe that the scale (when not using tools) plays a significant role in how the model's capacity is used to learn a task, the lack of similar effect with forgetting suggests a 13B model is only as good as a 125M model in retaining knowledge of past tasks.
Figure 5: The replay buffer plays a significant role in aiding LMs across tasks in mitigating the forgetting.
...and 2 more figures

Towards Practical Tool Usage for Continually Learning LLMs

TL;DR

Abstract

Towards Practical Tool Usage for Continually Learning LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)