Table of Contents
Fetching ...

A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models

Shubhra Mishra, Yuka Machino, Gabriel Poesia, Albert Jiang, Joy Hsu, Adrian Weller, Challenger Mishra, David Broman, Joshua B. Tenenbaum, Mateja Jamnik, Cedegao E. Zhang, Katherine M. Collins

TL;DR

This work investigates whether AI systems can reliably judge the worth of mathematical problems, separating interestingness from difficulty, and compares these judgments to humans’ across two data sources: crowdsourced Prolific participants and IMO competitors. By evaluating 12 models from 5 families, the study reveals that while many LLMs show broad alignment with human judgments, they do not capture the full distribution of human opinions and poorly reflect why humans find problems interesting. The findings highlight both the potential of AI as a mathematical thinking partner and its current limitations in capturing subjective, human-centered notions of interestingness, especially regarding problem elegance and rationale. These insights inform the design of AI systems for education and automated mathematical discovery, emphasizing the need to account for variability in human judgments and to align model explanations with human reasoning.

Abstract

The evolution of mathematics has been guided in part by interestingness. From researchers choosing which problems to tackle next, to students deciding which ones to engage with, people's choices are often guided by judgments about how interesting or challenging problems are likely to be. As AI systems, such as LLMs, increasingly participate in mathematics with people -- whether for advanced research or education -- it becomes important to understand how well their judgments align with human ones. Our work examines this alignment through two empirical studies of human and LLM assessment of mathematical interestingness and difficulty, spanning a range of mathematical experience. We study two groups: participants from a crowdsourcing platform and International Math Olympiad competitors. We show that while many LLMs appear to broadly agree with human notions of interestingness, they mostly do not capture the distribution observed in human judgments. Moreover, most LLMs only somewhat align with why humans find certain math problems interesting, showing weak correlation with human-selected interestingness rationales. Together, our findings highlight both the promises and limitations of current LLMs in capturing human interestingness judgments for mathematical AI thought partnerships.

A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models

TL;DR

This work investigates whether AI systems can reliably judge the worth of mathematical problems, separating interestingness from difficulty, and compares these judgments to humans’ across two data sources: crowdsourced Prolific participants and IMO competitors. By evaluating 12 models from 5 families, the study reveals that while many LLMs show broad alignment with human judgments, they do not capture the full distribution of human opinions and poorly reflect why humans find problems interesting. The findings highlight both the potential of AI as a mathematical thinking partner and its current limitations in capturing subjective, human-centered notions of interestingness, especially regarding problem elegance and rationale. These insights inform the design of AI systems for education and automated mathematical discovery, emphasizing the need to account for variability in human judgments and to align model explanations with human reasoning.

Abstract

The evolution of mathematics has been guided in part by interestingness. From researchers choosing which problems to tackle next, to students deciding which ones to engage with, people's choices are often guided by judgments about how interesting or challenging problems are likely to be. As AI systems, such as LLMs, increasingly participate in mathematics with people -- whether for advanced research or education -- it becomes important to understand how well their judgments align with human ones. Our work examines this alignment through two empirical studies of human and LLM assessment of mathematical interestingness and difficulty, spanning a range of mathematical experience. We study two groups: participants from a crowdsourcing platform and International Math Olympiad competitors. We show that while many LLMs appear to broadly agree with human notions of interestingness, they mostly do not capture the distribution observed in human judgments. Moreover, most LLMs only somewhat align with why humans find certain math problems interesting, showing weak correlation with human-selected interestingness rationales. Together, our findings highlight both the promises and limitations of current LLMs in capturing human interestingness judgments for mathematical AI thought partnerships.

Paper Structure

This paper contains 31 sections, 30 figures, 7 tables.

Figures (30)

  • Figure 1: Human interactions with mathematics involve not just solving problems, but deciding what problems are worth pursuing at all. LLMs, in many of their applications today, are asked to solve problems directly -- without "choice" in what to solve. Our work compares model and humans' judgments of problems at both the final judgment and the factors considered in that judgment.
  • Figure 2: Agreement in humans' and LLMs' judgments about problem interestingness (left) and difficulty (right) on the Prolific dataset. Each cell shows the squared Pearson correlation ($R^2$ scaled by 100), between the row and column reasoners' per-problem mean ratings. The left matrix is sorted in descending order by agreement, while the right matrix follows the order of the left matrix to enable easier comparison. Darker cells indicate higher agreement ($R^2$). Models were sampled at temperature $1.0$; additional analyses, e.g., for temperature $0.3$, are included in Appendix \ref{['app:model_correlation_prolific']}.
  • Figure 3: Judgment speed distributions across LRMs on low- vs. high-interest Prolific problems. A low-interest problem for a model is one that is given a below median interestingness score by the model. High-interest problems are those which are given higher than the median interestingness score. For each subset of problems, we label whether a judgment was slow, medium, or fast, based on the distribution of reasoning token counts for that model. Slow judgments occupy the bottom quartile and fast judgments occupy the top one, with medium-speed judgments covering the middle. We see that LRMs tend to engage in longer reasoning chains for problems that they ultimately label as being higher interest.
  • Figure 4: Agent–agent agreement on the Prolific dataset at temperature $0.30$. Each cell shows the squared Pearson correlation ($R^2$) between the row and column agents' per-problem mean ratings. Darker cells indicate higher agreement; the diagonal is 1.00 by definition. The Human row/column gives model–human agreement. Top: interestingness ratings; bottom: difficulty ratings.
  • Figure 5: Agent–agent agreement on the Prolific dataset at temperature $1.0$. Each cell shows the squared Pearson correlation ($R^2$) between the row and column agents' per-problem mean ratings. Darker cells indicate higher agreement. The Human row/column gives model–human agreement. (Top) interestingness ratings; (bottom)difficulty ratings.
  • ...and 25 more figures