Table of Contents
Fetching ...

When Models Know More Than They Say: Probing Analogical Reasoning in LLMs

Hope McGovern, Caroline Craig, Thomas Lippincott, Hale Sirin

Abstract

Analogical reasoning is a core cognitive faculty essential for narrative understanding. While LLMs perform well when surface and structural cues align, they struggle in cases where an analogy is not apparent on the surface but requires latent information, suggesting limitations in abstraction and generalisation. In this paper we compare a model's probed representations with its prompted performance at detecting narrative analogies, revealing an asymmetry: for rhetorical analogies, probing significantly outperforms prompting in open-source models, while for narrative analogies, they achieve a similar (low) performance. This suggests that the relationship between internal representations and prompted behavior is task-dependent and may reflect limitations in how prompting accesses available information.

When Models Know More Than They Say: Probing Analogical Reasoning in LLMs

Abstract

Analogical reasoning is a core cognitive faculty essential for narrative understanding. While LLMs perform well when surface and structural cues align, they struggle in cases where an analogy is not apparent on the surface but requires latent information, suggesting limitations in abstraction and generalisation. In this paper we compare a model's probed representations with its prompted performance at detecting narrative analogies, revealing an asymmetry: for rhetorical analogies, probing significantly outperforms prompting in open-source models, while for narrative analogies, they achieve a similar (low) performance. This suggests that the relationship between internal representations and prompted behavior is task-dependent and may reflect limitations in how prompting accesses available information.

Paper Structure

This paper contains 46 sections, 9 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Analogical reasoning is a higher-order capability that requires a combination of lower-level tasks such as entity detection or coreference resolution.
  • Figure 2: MAP for narrative (left) and rhetorical (right) parallelism tasks across different classifier architectures and model variants on Llama-3.2-1B. Bars show MAP scores (mean ± standard deviation) for three classifier types: cosine similarity (Cosine), logistic regression (Logreg), and multi-layer perceptron (MLP), with separate bars for base and instruction-tuned (Instruct) model variants.
  • Figure 3: Individual layer performance vs. all-layers configuration on Llama-3.2-1B base model with MLP classifiers. Top panels show learned layer weights from ScalarMix (averaged across cross-validation folds), indicating the relative contribution of each layer when all layers are combined. Bottom panels show mean average precision (MAP) for each individual layer (bars with error bars showing standard deviation) and the all-layers performance (red dashed horizontal line).
  • Figure 4: MAP on prompted ranking across different models
  • Figure 5: Distribution of branch sizes based on the number of spans per parallel set
  • ...and 7 more figures