An Incomplete Loop: Instruction Inference, Instruction Following, and In-context Learning in Language Models
Emmy Liu, Graham Neubig, Jacob Andreas
TL;DR
The paper investigates how three reasoning modalities—deductive instruction following, inductive in-context learning via few-shot prompting, and abductive instruction inference—interact in large language models. By evaluating GPT-3.5/4 and Llama-2 across linear functions, a simple artificial language, and Kalamang translation, it shows that instruction inference can outperform few-shot prompting in simple synthetic tasks but does not consistently generalize to complex, real-world translation. A key finding is the weak or non-existent correlation between a model's ability to induce instructions and its in-context learning performance, suggesting that these capabilities rely on distinct mechanisms. The work highlights the non-systematic nature of reasoning in current LMs and motivates future directions for validating hypotheses and jointly training models to achieve more robust autonomous learning and self-improvement.
Abstract
Modern language models (LMs) can learn to perform new tasks in different ways: in instruction following, the target task is described explicitly in natural language; in few-shot prompting, the task is specified implicitly with a small number of examples; in instruction inference, LMs are presented with in-context examples and are then prompted to generate a natural language task description before making predictions. Each of these procedures may be thought of as invoking a different form of reasoning: instruction following involves deductive reasoning, few-shot prompting involves inductive reasoning, and instruction inference involves abductive reasoning. How do these different capabilities relate? Across four LMs (from the gpt and llama families) and two learning problems (involving arithmetic functions and machine translation) we find a strong dissociation between the different types of reasoning: LMs can sometimes learn effectively from few-shot prompts even when they are unable to explain their own prediction rules; conversely, they sometimes infer useful task descriptions while completely failing to learn from human-generated descriptions of the same task. Our results highlight the non-systematic nature of reasoning even in some of today's largest LMs, and underscore the fact that very different learning mechanisms may be invoked by seemingly similar prompting procedures.
