Scalable Text-Embedding-informed Cognitive Diagnosis of Large Language Models

Jia Liu; Zhiyu Xu; Yuqi Gu

Scalable Text-Embedding-informed Cognitive Diagnosis of Large Language Models

Jia Liu, Zhiyu Xu, Yuqi Gu

Abstract

Large language models (LLMs) have achieved remarkable performance on diverse benchmarks, yet existing evaluation practices largely rely on coarse summary metrics that obscure underlying reasoning abilities. In this work, we propose novel methodologies to adapt cognitive diagnosis models (CDMs) in psychometrics to LLM evaluation, enabling fine-grained diagnosis via multidimensional discrete capability profiles and interpretable characterizations of LLM strengths and weaknesses. First, to enable CDM-based evaluation at benchmark scale (more than 1000 items), we propose a scalable method that jointly estimates LLM mastery profiles and the item-attribute Q-matrix, addressing key challenges posed by high-dimensional latent attributes (K > 20), large item pools, and the prohibitive computational cost of existing marginal maximum likelihood-based estimation. Second, we incorporate item-level textual information to construct AI-embedding-informed priors for the Q-matrix, stabilizing high-dimensional estimation while reducing reliance on costly human specification. We develop an efficient stochastic-approximation algorithm to jointly estimate LLM mastery profiles and the Q-matrix that balances data fit with text-embedding-informed priors. Simulation studies demonstrate accurate parameter recovery. An application to the MATH Level 5 benchmark illustrates the practical utility of our method for LLM evaluation and uncovers useful insights into LLMs' fine-grained capabilities.

Scalable Text-Embedding-informed Cognitive Diagnosis of Large Language Models

Abstract

Paper Structure (17 sections, 1 theorem, 23 equations, 7 figures, 6 tables)

This paper contains 17 sections, 1 theorem, 23 equations, 7 figures, 6 tables.

Introduction
CDM-Based Evaluation of LLMs
Evaluation Framework via Cognitive Diagnosis Modeling
Statistical Regimes of Large-Scale LLM Evaluation
Embedding-Informed Joint Estimation Framework
Embedding-Powered Construction and Prior Specification of the Reference $\mathbf{Q}$-Matrix
Embedding-Based Structural Discovery.
Reference $\mathbf{Q}$-Matrix as Structural Prior.
Scalable SAEM Algorithm with a Q-Matrix Prior
Consistency
Simulation Study
Real Data Application to LLM Evaluation
Embedding-Informed Prior and Model Fitting
Refinement of the Item-Attribute Q-Matrix: Agreement, Augmentation, and Reclassification
Attribute Mastery Profiles of LLMs: Strengths, Weaknesses, and Structural Gaps
...and 2 more sections

Key Result

Theorem 1

Consider the DINA model with $(\widehat{\mathbf Q},\widehat{\mathbf A})$ obtained from the proposed estimation procedure. Let denote the average positive response rate. As $N,J \to \infty$, suppose $\sqrt{J}=O\!(\sqrt{M}N^{1-u})$ for some $u\in(0,1)$ and $K=o(MJ\log J).$ Under Assumptions 1–3, the following hold: (a) For $\gamma_J = \frac{(\log J)^{1+\varepsilon}}{\sqrt{J}} \sqrt{M\log(2^{K})}$ w

Figures (7)

Figure 1: UMAP visualization of question–solution embeddings from the MATH-L5 benchmark, colored by official question types.
Figure 2: Root mean squared errors (RMSEs) of $g$ as functions of $N$ under different simulation settings.
Figure 3: Root mean squared errors (RMSEs) of $c$ as functions of $N$ under different simulation settings.
Figure 4: First two UMAP dimensions of question--solution embeddings with 28 embedding-derived clusters.
Figure 5: Comparison of the refined $\mathbf{Q}$-matrix and reference $\mathbf{Q}$-matrix ($N=45$). Shaded cells indicate estimated attribute associations ($\widehat{q}_{ij}=1$), while dark blue borders denote entries recovered from the reference $\mathbf{Q}$-matrix ($q_{ij}^{\text{(R)}}=1$).
...and 2 more figures

Theorems & Definitions (1)

Theorem 1: Consistency under the DINA model (Theorem 1 in gu2023joint)

Scalable Text-Embedding-informed Cognitive Diagnosis of Large Language Models

Abstract

Scalable Text-Embedding-informed Cognitive Diagnosis of Large Language Models

Authors

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (1)