Table of Contents
Fetching ...

Scalable Text-Embedding-informed Cognitive Diagnosis of Large Language Models

Jia Liu, Zhiyu Xu, Yuqi Gu

Abstract

Large language models (LLMs) have achieved remarkable performance on diverse benchmarks, yet existing evaluation practices largely rely on coarse summary metrics that obscure underlying reasoning abilities. In this work, we propose novel methodologies to adapt cognitive diagnosis models (CDMs) in psychometrics to LLM evaluation, enabling fine-grained diagnosis via multidimensional discrete capability profiles and interpretable characterizations of LLM strengths and weaknesses. First, to enable CDM-based evaluation at benchmark scale (more than 1000 items), we propose a scalable method that jointly estimates LLM mastery profiles and the item-attribute Q-matrix, addressing key challenges posed by high-dimensional latent attributes (K > 20), large item pools, and the prohibitive computational cost of existing marginal maximum likelihood-based estimation. Second, we incorporate item-level textual information to construct AI-embedding-informed priors for the Q-matrix, stabilizing high-dimensional estimation while reducing reliance on costly human specification. We develop an efficient stochastic-approximation algorithm to jointly estimate LLM mastery profiles and the Q-matrix that balances data fit with text-embedding-informed priors. Simulation studies demonstrate accurate parameter recovery. An application to the MATH Level 5 benchmark illustrates the practical utility of our method for LLM evaluation and uncovers useful insights into LLMs' fine-grained capabilities.

Scalable Text-Embedding-informed Cognitive Diagnosis of Large Language Models

Abstract

Large language models (LLMs) have achieved remarkable performance on diverse benchmarks, yet existing evaluation practices largely rely on coarse summary metrics that obscure underlying reasoning abilities. In this work, we propose novel methodologies to adapt cognitive diagnosis models (CDMs) in psychometrics to LLM evaluation, enabling fine-grained diagnosis via multidimensional discrete capability profiles and interpretable characterizations of LLM strengths and weaknesses. First, to enable CDM-based evaluation at benchmark scale (more than 1000 items), we propose a scalable method that jointly estimates LLM mastery profiles and the item-attribute Q-matrix, addressing key challenges posed by high-dimensional latent attributes (K > 20), large item pools, and the prohibitive computational cost of existing marginal maximum likelihood-based estimation. Second, we incorporate item-level textual information to construct AI-embedding-informed priors for the Q-matrix, stabilizing high-dimensional estimation while reducing reliance on costly human specification. We develop an efficient stochastic-approximation algorithm to jointly estimate LLM mastery profiles and the Q-matrix that balances data fit with text-embedding-informed priors. Simulation studies demonstrate accurate parameter recovery. An application to the MATH Level 5 benchmark illustrates the practical utility of our method for LLM evaluation and uncovers useful insights into LLMs' fine-grained capabilities.
Paper Structure (17 sections, 1 theorem, 23 equations, 7 figures, 6 tables)

This paper contains 17 sections, 1 theorem, 23 equations, 7 figures, 6 tables.

Key Result

Theorem 1

Consider the DINA model with $(\widehat{\mathbf Q},\widehat{\mathbf A})$ obtained from the proposed estimation procedure. Let denote the average positive response rate. As $N,J \to \infty$, suppose $\sqrt{J}=O\!(\sqrt{M}N^{1-u})$ for some $u\in(0,1)$ and $K=o(MJ\log J).$ Under Assumptions 1–3, the following hold: (a) For $\gamma_J = \frac{(\log J)^{1+\varepsilon}}{\sqrt{J}} \sqrt{M\log(2^{K})}$ w

Figures (7)

  • Figure 1: UMAP visualization of question–solution embeddings from the MATH-L5 benchmark, colored by official question types.
  • Figure 2: Root mean squared errors (RMSEs) of $g$ as functions of $N$ under different simulation settings.
  • Figure 3: Root mean squared errors (RMSEs) of $c$ as functions of $N$ under different simulation settings.
  • Figure 4: First two UMAP dimensions of question--solution embeddings with 28 embedding-derived clusters.
  • Figure 5: Comparison of the refined $\mathbf{Q}$-matrix and reference $\mathbf{Q}$-matrix ($N=45$). Shaded cells indicate estimated attribute associations ($\widehat{q}_{ij}=1$), while dark blue borders denote entries recovered from the reference $\mathbf{Q}$-matrix ($q_{ij}^{\text{(R)}}=1$).
  • ...and 2 more figures

Theorems & Definitions (1)

  • Theorem 1: Consistency under the DINA model (Theorem 1 in gu2023joint)