Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT
Bhavya Bhavya, Jinjun Xiong, Chengxiang Zhai
TL;DR
This paper investigates analogy generation by prompting large language models for two tasks: Analogous Concept Generation (ACG) and Analogy Explanation Generation (AEG). It shows that InstructGPT can produce meaningful analogies when prompts are precise and temperature is low, with performance strongly influenced by model size. The authors introduce new datasets and a thorough robustness analysis across prompts, temperatures, and spelling perturbations, plus a human evaluation that reveals larger models approach or reach human performance for ACG but still struggle with AEG. The work demonstrates the promise and limitations of prompt-driven analogy generation and outlines future directions for improving cross-domain generalization and evaluative methods.
Abstract
We propose a novel application of prompting Pre-trained Language Models (PLMs) to generate analogies and study how to design effective prompts for two task settings: generating a source concept analogous to a given target concept (aka Analogous Concept Generation or ACG), and generating an explanation of the similarity between a given pair of target concept and source concept (aka Analogous Explanation Generation or AEG). We found that it is feasible to prompt InstructGPT to generate meaningful analogies and the best prompts tend to be precise imperative statements especially with a low temperature setting. We also systematically analyzed the sensitivity of the InstructGPT model to prompt design, temperature, and injected spelling errors, and found that the model is particularly sensitive to certain variations (e.g., questions vs. imperative statements). Further, we conducted human evaluation on 1.4k of the generated analogies and found that the quality of generations varies substantially by model size. The largest InstructGPT model can achieve human-level performance at generating meaningful analogies for a given target while there is still room for improvement on the AEG task.
