Table of Contents
Fetching ...

Language hooks: a modular framework for augmenting LLM reasoning that decouples tool usage from the model and its prompt

Damien de Mijolla, Wen Yang, Philippa Duckett, Christopher Frye, Mark Worrall

TL;DR

Language hooks present a modular, task- and model-agnostic framework that interleaves base-model text generation with conditional program execution (hooks) to augment reasoning and tool usage. Hooks are defined as small, composable programs with triggers and eligibility checks, enabling capabilities like arithmetic validation, knowledge retrieval, and output guardrails without fine-tuning the base model. Empirical results across mathematical reasoning, multi-hop QA, and composite tasks show competitive performance with both general prompting baselines (CoT, ReAct) and task-aware methods (PAL, DSP), while preserving generalisability and enabling external validation. The approach offers a flexible, transparent pathway to extend LLM capabilities with reduced coupling to prompts and models, with potential applications in safety, verifiability, and modular tool integration.

Abstract

Prompting and fine-tuning have emerged as two competing paradigms for augmenting language models with new capabilities, such as the use of tools. Prompting approaches are quick to set up but rely on providing explicit demonstrations of each tool's usage in the model's prompt, thus coupling tool use to the task at hand and limiting generalisation. Fine-tuning removes the need for task-specific demonstrations of tool usage at runtime; however, this ties new capabilities to a single model, thus making already-heavier setup costs a recurring expense. In this paper, we introduce language hooks, a novel framework for augmenting language models with new capabilities that is decoupled both from the model's task-specific prompt and from the model itself. The language hook algorithm interleaves text generation by the base model with the execution of modular programs that trigger conditionally based on the existing text and the available capabilities. Upon triggering, programs may call external tools, auxiliary language models (e.g. using tool specific prompts), and modify the existing context. We benchmark our method against state-of-the-art baselines, find that it outperforms task-aware approaches, and demonstrate its ability to generalise to novel tasks.

Language hooks: a modular framework for augmenting LLM reasoning that decouples tool usage from the model and its prompt

TL;DR

Language hooks present a modular, task- and model-agnostic framework that interleaves base-model text generation with conditional program execution (hooks) to augment reasoning and tool usage. Hooks are defined as small, composable programs with triggers and eligibility checks, enabling capabilities like arithmetic validation, knowledge retrieval, and output guardrails without fine-tuning the base model. Empirical results across mathematical reasoning, multi-hop QA, and composite tasks show competitive performance with both general prompting baselines (CoT, ReAct) and task-aware methods (PAL, DSP), while preserving generalisability and enabling external validation. The approach offers a flexible, transparent pathway to extend LLM capabilities with reduced coupling to prompts and models, with potential applications in safety, verifiability, and modular tool integration.

Abstract

Prompting and fine-tuning have emerged as two competing paradigms for augmenting language models with new capabilities, such as the use of tools. Prompting approaches are quick to set up but rely on providing explicit demonstrations of each tool's usage in the model's prompt, thus coupling tool use to the task at hand and limiting generalisation. Fine-tuning removes the need for task-specific demonstrations of tool usage at runtime; however, this ties new capabilities to a single model, thus making already-heavier setup costs a recurring expense. In this paper, we introduce language hooks, a novel framework for augmenting language models with new capabilities that is decoupled both from the model's task-specific prompt and from the model itself. The language hook algorithm interleaves text generation by the base model with the execution of modular programs that trigger conditionally based on the existing text and the available capabilities. Upon triggering, programs may call external tools, auxiliary language models (e.g. using tool specific prompts), and modify the existing context. We benchmark our method against state-of-the-art baselines, find that it outperforms task-aware approaches, and demonstrate its ability to generalise to novel tasks.

Paper Structure

This paper contains 38 sections, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Schematic of the language hook algorithm in a toy example. The base model generates text sentence by sentence. Each language hook has a trigger that monitors the advantage of running its program on the existing context. When a hook triggers, its associated program is executed (e.g. knowledge retrieval or calculation check) and modifies (or not) the existing context. When no hook triggers, the base model generates the next sentence of its response, conditioned on the existing context. This iterative process continues until a stopping condition is met.
  • Figure 2: Setup for results in Table \ref{['table:ablation-guardrail']}. We run the base model with no active hooks to identify $S_1$, those questions which the base model answers, and $S_2$, those which it refuses to answer. Our guardrail hook then defines a subset $S_3 \subseteq S_2$, which the base model now answers but previously did not.
  • Figure 3: F1 score per dataset as we vary the trigger threshold for language hooks. CoT and base trigger rates correspond to results quoted in Tables \ref{['table:math-results']} and \ref{['table:multihop-results']}.
  • Figure 4: Trigger probability distributions. The left and right hand columns show the probability distributions for the calculator trigger (light red) and the retriever trigger (blue) respectively. A hook runs its program when $P(\text{trigger}) > \text{threshold}$. We show the base trigger threshold used in our main experiments from Section \ref{['experiments']} at 0.