FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition

Xiaoqiang Wang; Lingfei Wu; Tengfei Ma; Bang Liu

FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition

Xiaoqiang Wang, Lingfei Wu, Tengfei Ma, Bang Liu

TL;DR

FAC^2E, a framework for Fine-grained and Cognition-grounded LLM’ Capability Evaluation, formulate LLMs’ evaluation in a multi-dimensional and explainable manner by dissociating the language-related capabilities and the cognition-related ones.

Abstract

Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. However, such a paradigm fails to comprehensively differentiate the fine-grained language and cognitive skills, rendering the lack of sufficient interpretation to LLMs' capabilities. In this paper, we present FAC$^2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation. Specifically, we formulate LLMs' evaluation in a multi-dimensional and explainable manner by dissociating the language-related capabilities and the cognition-related ones. Besides, through extracting the intermediate reasoning from LLMs, we further break down the process of applying a specific capability into three sub-steps: recalling relevant knowledge, utilizing knowledge, and solving problems. Finally, FAC$^2$E evaluates each sub-step of each fine-grained capability, providing a two-faceted diagnosis for LLMs. Utilizing FAC$^2$E, we identify a common shortfall in knowledge utilization among models and propose a straightforward, knowledge-enhanced method to mitigate this issue. Our results not only showcase promising performance enhancements but also highlight a direction for future LLM advancements.

FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition

TL;DR

Abstract

E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation. Specifically, we formulate LLMs' evaluation in a multi-dimensional and explainable manner by dissociating the language-related capabilities and the cognition-related ones. Besides, through extracting the intermediate reasoning from LLMs, we further break down the process of applying a specific capability into three sub-steps: recalling relevant knowledge, utilizing knowledge, and solving problems. Finally, FAC

E evaluates each sub-step of each fine-grained capability, providing a two-faceted diagnosis for LLMs. Utilizing FAC

E, we identify a common shortfall in knowledge utilization among models and propose a straightforward, knowledge-enhanced method to mitigate this issue. Our results not only showcase promising performance enhancements but also highlight a direction for future LLM advancements.

Paper Structure (11 sections, 1 equation, 9 figures, 4 tables)

This paper contains 11 sections, 1 equation, 9 figures, 4 tables.

Introduction
Methodology
Formulation of LLMs' Capabilities
FAC$^2$E
Experiments
Main results
Boosting LLMs with Injected Knowledge
Related Works
Conclusion
Implementation Details
Instruction Design

Figures (9)

Figure 1: Illustration of FAC$^2$E pipeline. The input question is decomposed into two intermediate follow-up questions, which are used to help the model talk with itself to elicit reasoning sub-steps. FAC$^2$E evaluates each sub-step to reveal crystallized performance, fluid performance, and corresponding problem-solving performance. The content in the round parentheses is purely illustrative and is not part of the model input. The instruction has been omitted here for clarity. Please refer to Appendix \ref{['sec:instruction-design']} for full version example.
Figure 2: Pairwise correlation of problem-solving performance ($s_3$) among different capabilities. Please refer to Table \ref{['tab:capability-schema']} for full label names.
Figure 3: Bar diagram illustrating the relationship between problem-solving performance ($s_3$) and intermediate performance ($(s_1 + s_2) / 2$). Each bar of intermediate performance is divided into two stacked segments, the lower one denotes $s_1$, while the upper one denotes $s_2$.
Figure 4: Problem-solving performance of instruction-tuned LLaMA with different model sizes.
Figure 5: Problem-solving performance of LLaMA on different instruction-tuning datasets.
...and 4 more figures

FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition

TL;DR

Abstract

FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition

Authors

TL;DR

Abstract

Table of Contents

Figures (9)