Table of Contents
Fetching ...

Experimental Contexts Can Facilitate Robust Semantic Property Inference in Language Models, but Inconsistently

Kanishka Misra, Allyson Ettinger, Kyle Mahowald

TL;DR

A case-study on the extent to which experimental contexts can improve LMs’ robustness in performing property inheritance—predicting semantic properties of novel concepts, a task that they have been previously shown to fail on.

Abstract

Recent zero-shot evaluations have highlighted important limitations in the abilities of language models (LMs) to perform meaning extraction. However, it is now well known that LMs can demonstrate radical improvements in the presence of experimental contexts such as in-context examples and instructions. How well does this translate to previously studied meaning-sensitive tasks? We present a case-study on the extent to which experimental contexts can improve LMs' robustness in performing property inheritance -- predicting semantic properties of novel concepts, a task that they have been previously shown to fail on. Upon carefully controlling the nature of the in-context examples and the instructions, our work reveals that they can indeed lead to non-trivial property inheritance behavior in LMs. However, this ability is inconsistent: with a minimal reformulation of the task, some LMs were found to pick up on shallow, non-semantic heuristics from their inputs, suggesting that the computational principles of semantic property inference are yet to be mastered by LMs.

Experimental Contexts Can Facilitate Robust Semantic Property Inference in Language Models, but Inconsistently

TL;DR

A case-study on the extent to which experimental contexts can improve LMs’ robustness in performing property inheritance—predicting semantic properties of novel concepts, a task that they have been previously shown to fail on.

Abstract

Recent zero-shot evaluations have highlighted important limitations in the abilities of language models (LMs) to perform meaning extraction. However, it is now well known that LMs can demonstrate radical improvements in the presence of experimental contexts such as in-context examples and instructions. How well does this translate to previously studied meaning-sensitive tasks? We present a case-study on the extent to which experimental contexts can improve LMs' robustness in performing property inheritance -- predicting semantic properties of novel concepts, a task that they have been previously shown to fail on. Upon carefully controlling the nature of the in-context examples and the instructions, our work reveals that they can indeed lead to non-trivial property inheritance behavior in LMs. However, this ability is inconsistent: with a minimal reformulation of the task, some LMs were found to pick up on shallow, non-semantic heuristics from their inputs, suggesting that the computational principles of semantic property inference are yet to be mastered by LMs.
Paper Structure (25 sections, 3 equations, 8 figures, 5 tables)

This paper contains 25 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: LMs are prompted with in-context examples that are compatible with both, robust property inheritance, as well as position-based heuristics. At test time, we evaluate on cases where the heuristics support desirable behavior and on cases where they do not. We use stimuli from comps and its reformulation as a QA task.
  • Figure 2: Overall results from our experiments testing non-instruction tuned LMs on comps and comps-qa using in-context examples, with and without instructions. Results are aggregated across both heuristics: first-correct and recent-correct. Error bars are over different sets of in-context examples. Most models start off near chance in the 0-shot case, but many improve as more examples are given. Solid green line depicts each model's base property knowledge performance, while the black dashed line depicts chance performance.
  • Figure 3: Overall results on the four instruction-tuned models considered. Results are aggregated across both heuristics: first-correct and recent-correct. Error bars are over different sets of in-context examples. Solid green line depicts each model's base property knowledge performance, while the black dashed line depicts chance performance.
  • Figure 4: Finer-grained results on Instruct-tuned OLMo-7B LM demonstrating its preference for selecting the first concept, regardless of the heuristics.
  • Figure 5: Fine-grained results for non-instruct tuned LMs on comps as a function of the number of in-context examples (with and without instructions).
  • ...and 3 more figures