Table of Contents
Fetching ...

Law of the Weakest Link: Cross Capabilities of Large Language Models

Ming Zhong, Aston Zhang, Xuewei Wang, Rui Hou, Wenhan Xiong, Chenguang Zhu, Zhengxing Chen, Liang Tan, Chloe Bi, Mike Lewis, Sravya Popuri, Sharan Narang, Melanie Kambadur, Dhruv Mahajan, Sergey Edunov, Jiawei Han, Laurens van der Maaten

TL;DR

The paper tackles the gap between evaluating isolated capabilities of large language models and the multi-skill demands of real-world tasks. It defines seven core individual capabilities and seven cross-capabilities, then builds CrossEval, a benchmark of 1,400 prompts with 4,200 reference responses and 8,400 expert annotations, plus LLM-based evaluators that align well with human judgments. Across 17 models, results reveal a pervasive Law of the Weakest Link: cross-capability performance is largely constrained by the weakest involved ability, with many cross-scores falling below the best individual capabilities. The authors also show that selectively enhancing weaker capabilities yields greater cross-capability gains than boosting stronger ones, and they present a principle-based prompting approach to probe and improve weak links. Overall, the work highlights the need to prioritize cross-capability optimization to advance LLM effectiveness in complex, multi-dimensional tasks and provides CrossEval as a rigorous benchmark for future research.

Abstract

The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.

Law of the Weakest Link: Cross Capabilities of Large Language Models

TL;DR

The paper tackles the gap between evaluating isolated capabilities of large language models and the multi-skill demands of real-world tasks. It defines seven core individual capabilities and seven cross-capabilities, then builds CrossEval, a benchmark of 1,400 prompts with 4,200 reference responses and 8,400 expert annotations, plus LLM-based evaluators that align well with human judgments. Across 17 models, results reveal a pervasive Law of the Weakest Link: cross-capability performance is largely constrained by the weakest involved ability, with many cross-scores falling below the best individual capabilities. The authors also show that selectively enhancing weaker capabilities yields greater cross-capability gains than boosting stronger ones, and they present a principle-based prompting approach to probe and improve weak links. Overall, the work highlights the need to prioritize cross-capability optimization to advance LLM effectiveness in complex, multi-dimensional tasks and provides CrossEval as a rigorous benchmark for future research.

Abstract

The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.
Paper Structure (51 sections, 5 figures, 44 tables)

This paper contains 51 sections, 5 figures, 44 tables.

Figures (5)

  • Figure 1: Taxonomy visualizations for Image Recognition, Reasoning, and the corresponding cross capability. Each node represents a specific type of task. The first two taxonomies illustrate tasks that require only individual capabilities for LLMs to complete. The final taxonomy, however, depicts tasks that lie at the intersection of Image Recognition and Reasoning capabilities, necessitating the use of both abilities to accomplish them. For the full taxonomy of all the individual and capabilities and cross capabilities, please see Appendix \ref{['section:complete_taxonomy']}.
  • Figure 2: Ablation study on the number of reference examples.
  • Figure 3: Density distribution of cross-capability performance compared to the two individual capabilities. The plot illustrates a pronounced "Law of the Weakest Link" effect in LLMs, where performance in cross-capability tasks tends to cluster around the weaker individual capability. This pattern is consistently observed regardless of the evaluator used.
  • Figure 4: Effect of $\Delta$ on the density distribution of cross-capability performance evaluated by GPT-4o.
  • Figure 5: Effect of $\Delta$ on the density distribution of cross-capability performance evaluated by Claude 3.5 Sonnet.