Table of Contents
Fetching ...

LLMs as Function Approximators: Terminology, Taxonomy, and Questions for Evaluation

David Schlangen

TL;DR

The paper reframes LLMs as function approximators rather than agents with general intelligence, to clarify evaluation goals. It introduces a formal notion of prompt-induced functions and a taxonomy of function types, along with a structured evaluation framework spanning the prompt, induction, and function-space aspects. Illustrative examples such as HELM templates, MT-Bench, and system prompts demonstrate how this framing maps practical prompts to semantic functions. The discussion also highlights risks like prompt injection and the importance of safeguarding and controlling the induced functions. Overall, the work provides a principled language and checklist for evaluating what LLMs can reliably approximate and how to design prompts and controls to steer their behavior.

Abstract

Natural Language Processing has moved rather quickly from modelling specific tasks to taking more general pre-trained models and fine-tuning them for specific tasks, to a point where we now have what appear to be inherently generalist models. This paper argues that the resultant loss of clarity on what these models model leads to metaphors like "artificial general intelligences" that are not helpful for evaluating their strengths and weaknesses. The proposal is to see their generality, and their potential value, in their ability to approximate specialist function, based on a natural language specification. This framing brings to the fore questions of the quality of the approximation, but beyond that, also questions of discoverability, stability, and protectability of these functions. As the paper will show, this framing hence brings together in one conceptual framework various aspects of evaluation, both from a practical and a theoretical perspective, as well as questions often relegated to a secondary status (such as "prompt injection" and "jailbreaking").

LLMs as Function Approximators: Terminology, Taxonomy, and Questions for Evaluation

TL;DR

The paper reframes LLMs as function approximators rather than agents with general intelligence, to clarify evaluation goals. It introduces a formal notion of prompt-induced functions and a taxonomy of function types, along with a structured evaluation framework spanning the prompt, induction, and function-space aspects. Illustrative examples such as HELM templates, MT-Bench, and system prompts demonstrate how this framing maps practical prompts to semantic functions. The discussion also highlights risks like prompt injection and the importance of safeguarding and controlling the induced functions. Overall, the work provides a principled language and checklist for evaluating what LLMs can reliably approximate and how to design prompts and controls to steer their behavior.

Abstract

Natural Language Processing has moved rather quickly from modelling specific tasks to taking more general pre-trained models and fine-tuning them for specific tasks, to a point where we now have what appear to be inherently generalist models. This paper argues that the resultant loss of clarity on what these models model leads to metaphors like "artificial general intelligences" that are not helpful for evaluating their strengths and weaknesses. The proposal is to see their generality, and their potential value, in their ability to approximate specialist function, based on a natural language specification. This framing brings to the fore questions of the quality of the approximation, but beyond that, also questions of discoverability, stability, and protectability of these functions. As the paper will show, this framing hence brings together in one conceptual framework various aspects of evaluation, both from a practical and a theoretical perspective, as well as questions often relegated to a secondary status (such as "prompt injection" and "jailbreaking").
Paper Structure (11 sections, 3 figures)

This paper contains 11 sections, 3 figures.

Figures (3)

  • Figure 1: Figure 23 from helm2023, showing the prompt template for a multiple choice question task.
  • Figure 2: The informational components in MT-Bench example humanities-151zheng-et-al-chatbot-arena-2023
  • Figure 3: Functions (with restricted parts of domain and co-domain), with the task descriptions that induce them, and possible systematic relations between functions and task descriptions. In the background lurks an undesirable function that is not to be induced. Size of the surrounding function space $\mathcal{F}$ not to scale.

Theorems & Definitions (2)

  • Definition 1
  • Definition 2