Table of Contents
Fetching ...

Automated Capability Evaluation of Foundation Models

Arash Afkanpour, Omkar Dige, Fatemeh Tavakoli, Negin Baghbanzadeh, Farnaz Kohankhaki, Elham Dolatabadi

TL;DR

ACE addresses the limitations of static benchmarks by introducing an adaptive framework that uses frontier models to construct a structured capability hierarchy and generates diverse evaluation tasks, then employs active learning in a latent semantic space to efficiently estimate a subject model's capability function f_Ω. Demonstrated in Mathematics, ACE builds 433 capabilities and 11,800 tasks, achieving near-exhaustive evaluation with less than half of the capabilities evaluated and providing finer-grained insights than aggregate metrics. By embedding capabilities into a latent space and using Gaussian Process based active learning, ACE achieves balanced domain coverage and reveals skill differences across models that static datasets miss. This scalable, cost-efficient evaluation approach supports safer deployment of foundation models by enabling detailed, robust capability profiling.

Abstract

Current evaluation frameworks for foundation models rely heavily on static, manually curated benchmarks, limiting their ability to capture the full breadth of model capabilities. This paper introduces Active learning for Capability Evaluation (ACE), a novel framework for scalable, automated, and fine-grained evaluation of foundation models. ACE leverages the knowledge embedded in powerful frontier models to decompose a domain into semantically meaningful capabilities and generates diverse evaluation tasks, significantly reducing human effort. In Mathematics, ACE generated 433 capabilities and 11,800 tasks, covering 94% of Wikipedia-defined skills in the domain while introducing novel, coherent ones. To maximize efficiency, ACE fits a capability model in latent semantic space, allowing reliable approximation of a subject model's performance by evaluating only a subset of capabilities via active learning. It reaches within 0.01 RMSE of exhaustive evaluation by evaluating less than half of capabilities. Compared to static datasets, ACE provides more balanced coverage and uncovers fine-grained differences that aggregate metrics fail to capture. Our results demonstrate that ACE provides a more complete and informative picture of model capabilities, which is essential for safe and well-informed deployment of foundation models.

Automated Capability Evaluation of Foundation Models

TL;DR

ACE addresses the limitations of static benchmarks by introducing an adaptive framework that uses frontier models to construct a structured capability hierarchy and generates diverse evaluation tasks, then employs active learning in a latent semantic space to efficiently estimate a subject model's capability function f_Ω. Demonstrated in Mathematics, ACE builds 433 capabilities and 11,800 tasks, achieving near-exhaustive evaluation with less than half of the capabilities evaluated and providing finer-grained insights than aggregate metrics. By embedding capabilities into a latent space and using Gaussian Process based active learning, ACE achieves balanced domain coverage and reveals skill differences across models that static datasets miss. This scalable, cost-efficient evaluation approach supports safer deployment of foundation models by enabling detailed, robust capability profiling.

Abstract

Current evaluation frameworks for foundation models rely heavily on static, manually curated benchmarks, limiting their ability to capture the full breadth of model capabilities. This paper introduces Active learning for Capability Evaluation (ACE), a novel framework for scalable, automated, and fine-grained evaluation of foundation models. ACE leverages the knowledge embedded in powerful frontier models to decompose a domain into semantically meaningful capabilities and generates diverse evaluation tasks, significantly reducing human effort. In Mathematics, ACE generated 433 capabilities and 11,800 tasks, covering 94% of Wikipedia-defined skills in the domain while introducing novel, coherent ones. To maximize efficiency, ACE fits a capability model in latent semantic space, allowing reliable approximation of a subject model's performance by evaluating only a subset of capabilities via active learning. It reaches within 0.01 RMSE of exhaustive evaluation by evaluating less than half of capabilities. Compared to static datasets, ACE provides more balanced coverage and uncovers fine-grained differences that aggregate metrics fail to capture. Our results demonstrate that ACE provides a more complete and informative picture of model capabilities, which is essential for safe and well-informed deployment of foundation models.

Paper Structure

This paper contains 27 sections, 8 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: An overview of ACE. Left: Example capability hierarchy in Mathematics. Right: The ACE pipeline combining automated capability generation, task generation and verification, and active learning in latent space for efficient model evaluation.
  • Figure 2: Coverage and validity of ACE-generated benchmarks. (a) Task distributions across mathematical areas for ACE (orange), GSM8K dataset (blue) and MATH dataset (green). (b) Subject model performance on MATH vs. ACE (synthetic) tasks. Stars indicate average score across all capabilities.
  • Figure 3: (a) Area-level benchmarking: subject model scores across mathematical areas. The reported score for each area is the average score of all capabilities within that area. (b) Semantic structure in latent space: Effect of input text and dimensionality reduction technique on capability function approximation.
  • Figure 4: Performance of approximating the capability function. (Left) RMSE, (Right) Uncertainty (average standard deviation) over iterations of active learning. Shaded areas indicate 95% confidence intervals.
  • Figure 5: Two-dimensional representation of Mathematics capabilities using t-SNE (left) and PCA (right). Each point corresponds to a capability, and colors indicate high-level areas. Stars indicate the mean of capability representations for each area.
  • ...and 2 more figures