Table of Contents
Fetching ...

Characterizing the Role of Similarity in the Property Inferences of Language Models

Juan Diego Rodriguez, Aaron Mueller, Kanishka Misra

TL;DR

The paper investigates whether taxonomy or similarity drives property inheritance in language models, and whether these factors interact rather than act independently. It combines behavioral experiments with nonce properties and two similarity metrics (Word-Sense and SPoSE) with a causal interpretability approach (Distributed Alignment Search) across four instruction-tuned LMs. The findings show that taxonomy and similarity jointly influence property projection, with similarity signals encoded in causal subspaces that are entangled with taxonomic information; SPoSE similarity aligns more with LM judgments than Word-Sense similarity. This challenges purely taxonomic views of LM reasoning, highlights human-like content effects in conceptual representations, and suggests new directions for psycholinguistic experiments and interpretability methods to probe inductive generalization in neural networks.

Abstract

Property inheritance -- a phenomenon where novel properties are projected from higher level categories (e.g., birds) to lower level ones (e.g., sparrows) -- provides a unique window into how humans organize and deploy conceptual knowledge. It is debated whether this ability arises due to explicitly stored taxonomic knowledge vs. simple computations of similarity between mental representations. How are these mechanistic hypotheses manifested in contemporary language models? In this work, we investigate how LMs perform property inheritance with behavioral and causal representational analysis experiments. We find that taxonomy and categorical similarities are not mutually exclusive in LMs' property inheritance behavior. That is, LMs are more likely to project novel properties from one category to the other when they are taxonomically related and at the same time, highly similar. Our findings provide insight into the conceptual structure of language models and may suggest new psycholinguistic experiments for human subjects.

Characterizing the Role of Similarity in the Property Inferences of Language Models

TL;DR

The paper investigates whether taxonomy or similarity drives property inheritance in language models, and whether these factors interact rather than act independently. It combines behavioral experiments with nonce properties and two similarity metrics (Word-Sense and SPoSE) with a causal interpretability approach (Distributed Alignment Search) across four instruction-tuned LMs. The findings show that taxonomy and similarity jointly influence property projection, with similarity signals encoded in causal subspaces that are entangled with taxonomic information; SPoSE similarity aligns more with LM judgments than Word-Sense similarity. This challenges purely taxonomic views of LM reasoning, highlights human-like content effects in conceptual representations, and suggests new directions for psycholinguistic experiments and interpretability methods to probe inductive generalization in neural networks.

Abstract

Property inheritance -- a phenomenon where novel properties are projected from higher level categories (e.g., birds) to lower level ones (e.g., sparrows) -- provides a unique window into how humans organize and deploy conceptual knowledge. It is debated whether this ability arises due to explicitly stored taxonomic knowledge vs. simple computations of similarity between mental representations. How are these mechanistic hypotheses manifested in contemporary language models? In this work, we investigate how LMs perform property inheritance with behavioral and causal representational analysis experiments. We find that taxonomy and categorical similarities are not mutually exclusive in LMs' property inheritance behavior. That is, LMs are more likely to project novel properties from one category to the other when they are taxonomically related and at the same time, highly similar. Our findings provide insight into the conceptual structure of language models and may suggest new psycholinguistic experiments for human subjects.

Paper Structure

This paper contains 35 sections, 3 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Property inheritance involves projecting properties from a category to its members. For example, if dogs are daxable, are corgisdaxable? Here, daxable is a nonce word used to study property inheritance without any confounding effects from parametric knowledge. Language models may rely on taxonomic (left) and/or similarity (right) relations to perform property inheritance. We investigate the interplay between these two effects in LMs' property inheritance judgments using both behavioral and mechanistic analyses.
  • Figure 2: LMs' Average Relative Probability of 'Yes' for different conclusion categories and different types of similarities (Word-Sense vs SPoSE). LMs show a clear sensitivity to taxonomic relations, but also show an effect of similarity, where they are more likely to extend the property to a conclusion category when the premise and the conclusion categories are highly similar. Chance behavior is 0.50, as indicated by the dashed line.
  • Figure 3: We hypothesize that language models rely on this causal graph to perform the categorical inference task. We are interested specifically in the node responsible for taxonomic judgments, so we use DAS (§\ref{['sec:causally-disentangling']}) to isolate the subspace encoding this causal variable. We causally verify its sensitivity to taxonomy by setting its value to what it would have been on counterfactual source inputs, and observing whether model behavior changes appropriately. Here, activations at the isolated subspace for the base are replaced with activations from the source. IIA measures to what extent this intervention results in the expected behavior given the hypothesized causal graph across inputs.
  • Figure 4: Interchange Intervention Accuracies (IIA) for the causal graph in \ref{['fig:das_diagram']} for Gemma-2-9B-IT (Top) and Mistral-7B-Instruct-0.2 (Bottom), when intervening at various layers and token positions, with negative samples derived using SPoSE similarities. Since the premise and conclusion nouns are often multiple tokens, we show IIA when intervening at the first and last positions of each. Note: Both models have different numbers of layers.
  • Figure 5: DAS interventions on pairs where the direction of the inference is flipped. Low IIA values for this experiment reveal whether the learned subspaces for property inheritance are sensitive to direction.
  • ...and 2 more figures