Table of Contents
Fetching ...

AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies

Xiao Ye, Andrew Wang, Jacob Choi, Yining Lu, Shreya Sharma, Lingfeng Shen, Vijay Tiyyala, Nicholas Andrews, Daniel Khashabi

TL;DR

AnaloBench assesses whether state-of-the-art language models can perform abstract and long-context analogies. It introduces two tasks, T1 and T2, built on 340 human-written analogies organized into 47 clusters and elaborated into 10- and 30-sentence stories. The study finds that while larger models improve on short analogies, gains largely plateau for longer narratives and from large candidate pools, with GPT-4 and Claude-v2 not matching human performance on long-context tasks. The work highlights fundamental challenges in LM analogical reasoning and provides a data-release to spur future research.

Abstract

Humans regularly engage in analogical thinking, relating personal experiences to current situations (X is analogous to Y because of Z). Analogical thinking allows humans to solve problems in creative ways, grasp difficult concepts, and articulate ideas more effectively. Can language models (LMs) do the same? To answer this question, we propose AnaloBench, a benchmark to determine analogical reasoning ability in LMs. Our benchmarking approach focuses on aspects of this ability that are common among humans: (i) recalling related experiences from a large amount of information, and (ii) applying analogical reasoning to complex and lengthy scenarios. We test a broad collection of proprietary models (e.g., GPT family, Claude V2) and open source models such as LLaMA2. As in prior results, scaling up LMs results in some performance boosts. Surprisingly, scale offers minimal gains when, (i) analogies involve lengthy scenarios, or (ii) recalling relevant scenarios from a large pool of information, a process analogous to finding a needle in a haystack. We hope these observations encourage further research in this field.

AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies

TL;DR

AnaloBench assesses whether state-of-the-art language models can perform abstract and long-context analogies. It introduces two tasks, T1 and T2, built on 340 human-written analogies organized into 47 clusters and elaborated into 10- and 30-sentence stories. The study finds that while larger models improve on short analogies, gains largely plateau for longer narratives and from large candidate pools, with GPT-4 and Claude-v2 not matching human performance on long-context tasks. The work highlights fundamental challenges in LM analogical reasoning and provides a data-release to spur future research.

Abstract

Humans regularly engage in analogical thinking, relating personal experiences to current situations (X is analogous to Y because of Z). Analogical thinking allows humans to solve problems in creative ways, grasp difficult concepts, and articulate ideas more effectively. Can language models (LMs) do the same? To answer this question, we propose AnaloBench, a benchmark to determine analogical reasoning ability in LMs. Our benchmarking approach focuses on aspects of this ability that are common among humans: (i) recalling related experiences from a large amount of information, and (ii) applying analogical reasoning to complex and lengthy scenarios. We test a broad collection of proprietary models (e.g., GPT family, Claude V2) and open source models such as LLaMA2. As in prior results, scaling up LMs results in some performance boosts. Surprisingly, scale offers minimal gains when, (i) analogies involve lengthy scenarios, or (ii) recalling relevant scenarios from a large pool of information, a process analogous to finding a needle in a haystack. We hope these observations encourage further research in this field.
Paper Structure (52 sections, 1 equation, 7 figures, 9 tables)

This paper contains 52 sections, 1 equation, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The problem setup: given a story, the goal is to identify an analogous story from a story bank. We study the difficulty of this goal for LMs by varying the following parameters: (i) length of stories, (ii) number of stories in the story bank. In the example, both "Maria" and "the oak" lose the ability to provide for others. While the strength of analogies can vary, we design our benchmark to account for this variation.
  • Figure 2: Overview of AnaloBench, for both the story expansion and the task creation §\ref{['sec:task-descriptions']}. Our abstract analogy identification benchmark features two tasks: ($T_1$) Identifying analogies from a mini story bank and ($T_2$) Identifying analogies from a large story bank. Each task is repeated at varying story lengths ($\sim$ 1, 10, and 30 sentences), with LLMs extending each story to target length. We find that while analogical reasoning shows signs of emergence, reasoning over longer and more complex analogies remains a challenge for state of the art LMs.
  • Figure 3: An overview of dataset creation (§\ref{['subsec:creation']}). Left: Human annotators are asked to create pairs of analogous sentences. Sentences can be repeated from analogy to analogy. Right: Pairs that share a sentence can be grouped into a cluster of mutually analogous sentences by transitivity.
  • Figure 4: Accuracy of LMs on $T_1$ (§\ref{['sec:results']}).
  • Figure 5: Precision-recall plot (in percentage) of LMs on $T_2$ (§\ref{['subsec:analogy-results']}) at three different story lengths (1, 10, 30 sentences). With increasing story length, the precision-recall of the models approaches random.
  • ...and 2 more figures