Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning
Jun Zhao, Jingqi Tong, Yurong Mou, Ming Zhang, Qi Zhang, Xuanjing Huang
TL;DR
The paper examines whether large language models can generalize compositionally in mathematical reasoning by introducing MathTrap, a dataset that injects logical traps into standard math problems to create unseen scenarios requiring the blending of core mathematical knowledge with trap-specific concepts. It defines compositionality within Hilbert's formal deductive framework and evaluates models across Original, Conceptual, and Trap problems, using GPT-4 and Claude-3.5-Sonnet as judges and various intervention strategies (prompts, few-shot demonstrations, and fine-tuning). Results show that while LLMs harbor the necessary knowledge, they seldom spontaneously compose it to solve trap problems, though human performance substantially outpaces models (with a notable gap that can be reduced via slow-thinking interventions like o1). External interventions mitigate some deficiencies, but there remains a sizable gap between how humans and current LLMs handle novel compositional reasoning tasks, underscoring an ongoing challenge in achieving robust compositional generalization. The work provides a data-driven framework to stress-test compositional reasoning in math and informs future directions for alignment-aware prompting, data augmentation, and targeted fine-tuning.
Abstract
Human cognition exhibits systematic compositionality, the algebraic ability to generate infinite novel combinations from finite learned components, which is the key to understanding and reasoning about complex logic. In this work, we investigate the compositionality of large language models (LLMs) in mathematical reasoning. Specifically, we construct a new dataset \textsc{MathTrap} by introducing carefully designed logical traps into the problem descriptions of MATH and GSM8K. Since problems with logical flaws are quite rare in the real world, these represent "unseen" cases to LLMs. Solving these requires the models to systematically compose (1) the mathematical knowledge involved in the original problems with (2) knowledge related to the introduced traps. Our experiments show that while LLMs possess both components of requisite knowledge, they do not \textbf{spontaneously} combine them to handle these novel cases. We explore several methods to mitigate this deficiency, such as natural language prompts, few-shot demonstrations, and fine-tuning. Additionally, we test the recently released OpenAI o1 model and find that human-like `slow thinking' helps improve the compositionality of LLMs. Overall, systematic compositionality remains an open challenge for large language models.
