A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models
Shubhra Mishra, Yuka Machino, Gabriel Poesia, Albert Jiang, Joy Hsu, Adrian Weller, Challenger Mishra, David Broman, Joshua B. Tenenbaum, Mateja Jamnik, Cedegao E. Zhang, Katherine M. Collins
TL;DR
This work investigates whether AI systems can reliably judge the worth of mathematical problems, separating interestingness from difficulty, and compares these judgments to humans’ across two data sources: crowdsourced Prolific participants and IMO competitors. By evaluating 12 models from 5 families, the study reveals that while many LLMs show broad alignment with human judgments, they do not capture the full distribution of human opinions and poorly reflect why humans find problems interesting. The findings highlight both the potential of AI as a mathematical thinking partner and its current limitations in capturing subjective, human-centered notions of interestingness, especially regarding problem elegance and rationale. These insights inform the design of AI systems for education and automated mathematical discovery, emphasizing the need to account for variability in human judgments and to align model explanations with human reasoning.
Abstract
The evolution of mathematics has been guided in part by interestingness. From researchers choosing which problems to tackle next, to students deciding which ones to engage with, people's choices are often guided by judgments about how interesting or challenging problems are likely to be. As AI systems, such as LLMs, increasingly participate in mathematics with people -- whether for advanced research or education -- it becomes important to understand how well their judgments align with human ones. Our work examines this alignment through two empirical studies of human and LLM assessment of mathematical interestingness and difficulty, spanning a range of mathematical experience. We study two groups: participants from a crowdsourcing platform and International Math Olympiad competitors. We show that while many LLMs appear to broadly agree with human notions of interestingness, they mostly do not capture the distribution observed in human judgments. Moreover, most LLMs only somewhat align with why humans find certain math problems interesting, showing weak correlation with human-selected interestingness rationales. Together, our findings highlight both the promises and limitations of current LLMs in capturing human interestingness judgments for mathematical AI thought partnerships.
