An Incomplete Loop: Instruction Inference, Instruction Following, and In-context Learning in Language Models

Emmy Liu; Graham Neubig; Jacob Andreas

An Incomplete Loop: Instruction Inference, Instruction Following, and In-context Learning in Language Models

Emmy Liu, Graham Neubig, Jacob Andreas

TL;DR

The paper investigates how three reasoning modalities—deductive instruction following, inductive in-context learning via few-shot prompting, and abductive instruction inference—interact in large language models. By evaluating GPT-3.5/4 and Llama-2 across linear functions, a simple artificial language, and Kalamang translation, it shows that instruction inference can outperform few-shot prompting in simple synthetic tasks but does not consistently generalize to complex, real-world translation. A key finding is the weak or non-existent correlation between a model's ability to induce instructions and its in-context learning performance, suggesting that these capabilities rely on distinct mechanisms. The work highlights the non-systematic nature of reasoning in current LMs and motivates future directions for validating hypotheses and jointly training models to achieve more robust autonomous learning and self-improvement.

Abstract

Modern language models (LMs) can learn to perform new tasks in different ways: in instruction following, the target task is described explicitly in natural language; in few-shot prompting, the task is specified implicitly with a small number of examples; in instruction inference, LMs are presented with in-context examples and are then prompted to generate a natural language task description before making predictions. Each of these procedures may be thought of as invoking a different form of reasoning: instruction following involves deductive reasoning, few-shot prompting involves inductive reasoning, and instruction inference involves abductive reasoning. How do these different capabilities relate? Across four LMs (from the gpt and llama families) and two learning problems (involving arithmetic functions and machine translation) we find a strong dissociation between the different types of reasoning: LMs can sometimes learn effectively from few-shot prompts even when they are unable to explain their own prediction rules; conversely, they sometimes infer useful task descriptions while completely failing to learn from human-generated descriptions of the same task. Our results highlight the non-systematic nature of reasoning even in some of today's largest LMs, and underscore the fact that very different learning mechanisms may be invoked by seemingly similar prompting procedures.

An Incomplete Loop: Instruction Inference, Instruction Following, and In-context Learning in Language Models

TL;DR

Abstract

Paper Structure (29 sections, 2 equations, 7 figures, 9 tables)

This paper contains 29 sections, 2 equations, 7 figures, 9 tables.

Introduction
Three Types of Reasoning in Language Models
Instruction Following
Few-Shot Prompting
Instruction Inference
Methods
Domains and Evaluation
Linear Functions
Simple Artificial Languages
Kalamang Translation
Results
When Does Instruction Inference Improve Over In-Context Learning?
How Does the Ability to Induce Instructions Relate to In-Context Learning?
Related Work
Conclusion
...and 14 more sections

Figures (7)

Figure 1: Diagram of abductive reasoning for an LM. Red arrows show data flow in inductive reasoning (few-shot prompting), while blue arrows show data flow in deductive reasoning (instruction following). Black arrows indicate data flow unique to abductive reasoning (instruction induction). Instruction inference generally improves on few-shot prompting and zero-shot chain of thought. However, success at inductive reasoning and success at instruction inference are not related.
Figure 2: Real coefficients of linear functions and relationship to hypothesized coefficients for GPT-3.5-turbo and GPT-4-turbo. Remaining models can be found in \ref{['appendix:predicted-corr']}. The x-axis has been truncated for visualization purposes (as there are some large outlier hypotheses). GPT-4-turbo is able to induce a reasonable function in-context, but other models struggle.
Figure 3: Accuracy of models in synthetic domains with and without hypothesis generation. Error bars indicate standard error. The top row shows results for the functions domain, while the bottom row shows results for the colours domain. Results are aggregated across 6 runs, and zero values are marked with '0'.
Figure 4: chrF scores for Kalamang under different methods, in English to Kalamang direction (top row) and Kalamang to English direction (bottom row)
Figure 6: Model predictions plotted against true function output for all models. Range is restricted to the [-400, 400] range for visualization purposes, although there are large outlier values for all models.
...and 2 more figures

An Incomplete Loop: Instruction Inference, Instruction Following, and In-context Learning in Language Models

TL;DR

Abstract

An Incomplete Loop: Instruction Inference, Instruction Following, and In-context Learning in Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)