Table of Contents
Fetching ...

ParallelPARC: A Scalable Pipeline for Generating Natural-Language Analogies

Oren Sultan, Yonatan Bitton, Ron Yosef, Dafna Shahaf

TL;DR

ParallelPARC presents a scalable pipeline that leverages state-of-the-art LLMs to generate complex, paragraph-length analogies and challenging distractors, addressing the scarcity of such data in computational analogy research. By coupling automatic candidate generation with targeted human annotation and a GPT-4 auto-labeler, the authors produce gold- and silver-sets for ProPara-Logy, a large benchmark of paragraph-based analogies about scientific processes. Evaluations show humans outperform the best models by a notable margin after light supervision, while automatically generated data effectively enhances training for smaller models; distractors robustly challenge models, revealing core weaknesses. The work demonstrates a practical, domain-adaptable pipeline with broad implications for education, creativity, and AI generalization across domains.

Abstract

Analogy-making is central to human cognition, allowing us to adapt to novel situations -- an ability that current AI systems still lack. Most analogy datasets today focus on simple analogies (e.g., word analogies); datasets including complex types of analogies are typically manually curated and very small. We believe that this holds back progress in computational analogy. In this work, we design a data generation pipeline, ParallelPARC (Parallel Paragraph Creator) leveraging state-of-the-art Large Language Models (LLMs) to create complex, paragraph-based analogies, as well as distractors, both simple and challenging. We demonstrate our pipeline and create ProPara-Logy, a dataset of analogies between scientific processes. We publish a gold-set, validated by humans, and a silver-set, generated automatically. We test LLMs' and humans' analogy recognition in binary and multiple-choice settings, and found that humans outperform the best models (~13% gap) after a light supervision. We demonstrate that our silver-set is useful for training models. Lastly, we show challenging distractors confuse LLMs, but not humans. We hope our pipeline will encourage research in this emerging field.

ParallelPARC: A Scalable Pipeline for Generating Natural-Language Analogies

TL;DR

ParallelPARC presents a scalable pipeline that leverages state-of-the-art LLMs to generate complex, paragraph-length analogies and challenging distractors, addressing the scarcity of such data in computational analogy research. By coupling automatic candidate generation with targeted human annotation and a GPT-4 auto-labeler, the authors produce gold- and silver-sets for ProPara-Logy, a large benchmark of paragraph-based analogies about scientific processes. Evaluations show humans outperform the best models by a notable margin after light supervision, while automatically generated data effectively enhances training for smaller models; distractors robustly challenge models, revealing core weaknesses. The work demonstrates a practical, domain-adaptable pipeline with broad implications for education, creativity, and AI generalization across domains.

Abstract

Analogy-making is central to human cognition, allowing us to adapt to novel situations -- an ability that current AI systems still lack. Most analogy datasets today focus on simple analogies (e.g., word analogies); datasets including complex types of analogies are typically manually curated and very small. We believe that this holds back progress in computational analogy. In this work, we design a data generation pipeline, ParallelPARC (Parallel Paragraph Creator) leveraging state-of-the-art Large Language Models (LLMs) to create complex, paragraph-based analogies, as well as distractors, both simple and challenging. We demonstrate our pipeline and create ProPara-Logy, a dataset of analogies between scientific processes. We publish a gold-set, validated by humans, and a silver-set, generated automatically. We test LLMs' and humans' analogy recognition in binary and multiple-choice settings, and found that humans outperform the best models (~13% gap) after a light supervision. We demonstrate that our silver-set is useful for training models. Lastly, we show challenging distractors confuse LLMs, but not humans. We hope our pipeline will encourage research in this emerging field.
Paper Structure (37 sections, 22 figures, 3 tables)

This paper contains 37 sections, 22 figures, 3 tables.

Figures (22)

  • Figure 1: Our data generation pipeline. We generate analogy candidates, then collect human annotations on a random sample to be used as few-shot for an auto-labeling model. We run the model to label candidates at scale. We randomly split the data into silver-set and gold-set, which is validated by humans. In addition to positives (analogies), we include random target paragraphs (simple negatives), and generate distractors (challenging negatives).
  • Figure 2: An example of an analogous sample from our dataset (generated by our pipeline). Two scientific processes, base and target, are described via a title, a domain, and a paragraph of natural-language text. A sample also includes similar relations, hinting at why the processes are analogous.
  • Figure 3: An example of the distractor creation process. On the left is the Base paragraph (about bats using echolocation). In the middle, a Target paragraph, which is analogous to the base paragraph. On the right is a Target (Distractor) paragraph, generated from the middle paragraph by switching the order of events: The emission of sound waves, followed by their reception as an echo, and submarines interpret the received echo. In the Target (Distractor), the order is reversed, altering the cause-and-effect relations from the true analogy.
  • Figure 4: A one-shot prompt for finding a target analogous subject and generating the similar relations between base and target.
  • Figure 5: A one-shot prompt for writing a target paragraph given the subject and the relations in target.
  • ...and 17 more figures