Benchmarking and Understanding Compositional Relational Reasoning of LLMs

Ruikang Ni; Da Xiao; Qingye Meng; Xiangyu Li; Shihui Zheng; Hongliang Liang

Benchmarking and Understanding Compositional Relational Reasoning of LLMs

Ruikang Ni, Da Xiao, Qingye Meng, Xiangyu Li, Shihui Zheng, Hongliang Liang

TL;DR

This work tackles the problem of compositional relational reasoning (CRR) in transformer LLMs. It introduces Generalized Associative Recall (GAR), a synthetic benchmark that unifies associative recall and knowledge recall tasks through relational loops with controllable difficulty, enabling systematic mechanistic interpretability (MI) analysis. Through attribution patching and targeted interventions on Vicuna-33B, it uncovers core circuits and a set of attention heads—most notably True/False heads—that encode truth judgments and drive CRR across tasks; these heads show universality across model sizes and transfer to other truth-detection datasets like SNLI and GoT. The findings expose a persistent compositionality gap that scales with model size and offer actionable insights for diagnosing and improving CRR in LLMs, with open datasets and code at GAR.

Abstract

Compositional relational reasoning (CRR) is a hallmark of human intelligence, but we lack a clear understanding of whether and how existing transformer large language models (LLMs) can solve CRR tasks. To enable systematic exploration of the CRR capability of LLMs, we first propose a new synthetic benchmark called Generalized Associative Recall (GAR) by integrating and generalizing the essence of several tasks in mechanistic interpretability (MI) study in a unified framework. Evaluation shows that GAR is challenging enough for existing LLMs, revealing their fundamental deficiency in CRR. Meanwhile, it is easy enough for systematic MI study. Then, to understand how LLMs solve GAR tasks, we use attribution patching to discover the core circuits reused by Vicuna-33B across different tasks and a set of vital attention heads. Intervention experiments show that the correct functioning of these heads significantly impacts task performance. Especially, we identify two classes of heads whose activations represent the abstract notion of true and false in GAR tasks respectively. They play a fundamental role in CRR across various models and tasks. The dataset and code are available at https://github.com/Caiyun-AI/GAR.

Benchmarking and Understanding Compositional Relational Reasoning of LLMs

TL;DR

Abstract

Benchmarking and Understanding Compositional Relational Reasoning of LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)