Table of Contents
Fetching ...

DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks

Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, Xing Xie

TL;DR

<3-5 sentence high-level summary> DyVal introduces a dynamic evaluation protocol to robustly test large language models on reasoning tasks while mitigating data contamination and the stagnation of static benchmarks. It formalizes a general description language and a graph-informed sampling scheme using DAGs to generate samples with tunable complexity across mathematics, logic, and algorithms. Experimental results show that DyVal reveals performance gaps not apparent on static benchmarks and that DyVal-generated data can be leveraged to further fine-tune models on existing benchmarks. The framework is extensible to natural language tasks and supports co-evolution with current evaluation paradigms to drive more robust model assessment and improvement.

Abstract

Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns are raised about potential data contamination in their considerable volume of training corpus. Moreover, the static nature and fixed complexity of current benchmarks may inadequately gauge the advancing capabilities of LLMs. In this paper, we introduce DyVal, a general and flexible protocol for dynamic evaluation of LLMs. Based on our framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs to dynamically generate evaluation samples with controllable complexities. DyVal generates challenging evaluation sets on reasoning tasks including mathematics, logical reasoning, and algorithm problems. We evaluate various LLMs ranging from Flan-T5-large to GPT-3.5-Turbo and GPT-4. Experiments show that LLMs perform worse in DyVal-generated evaluation samples with different complexities, highlighting the significance of dynamic evaluation. We also analyze the failure cases and results of different prompting methods. Moreover, DyVal-generated samples are not only evaluation sets, but also helpful data for fine-tuning to improve the performance of LLMs on existing benchmarks. We hope that DyVal can shed light on future evaluation research of LLMs. Code is available at: https://github.com/microsoft/promptbench.

DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks

TL;DR

<3-5 sentence high-level summary> DyVal introduces a dynamic evaluation protocol to robustly test large language models on reasoning tasks while mitigating data contamination and the stagnation of static benchmarks. It formalizes a general description language and a graph-informed sampling scheme using DAGs to generate samples with tunable complexity across mathematics, logic, and algorithms. Experimental results show that DyVal reveals performance gaps not apparent on static benchmarks and that DyVal-generated data can be leveraged to further fine-tune models on existing benchmarks. The framework is extensible to natural language tasks and supports co-evolution with current evaluation paradigms to drive more robust model assessment and improvement.

Abstract

Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns are raised about potential data contamination in their considerable volume of training corpus. Moreover, the static nature and fixed complexity of current benchmarks may inadequately gauge the advancing capabilities of LLMs. In this paper, we introduce DyVal, a general and flexible protocol for dynamic evaluation of LLMs. Based on our framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs to dynamically generate evaluation samples with controllable complexities. DyVal generates challenging evaluation sets on reasoning tasks including mathematics, logical reasoning, and algorithm problems. We evaluate various LLMs ranging from Flan-T5-large to GPT-3.5-Turbo and GPT-4. Experiments show that LLMs perform worse in DyVal-generated evaluation samples with different complexities, highlighting the significance of dynamic evaluation. We also analyze the failure cases and results of different prompting methods. Moreover, DyVal-generated samples are not only evaluation sets, but also helpful data for fine-tuning to improve the performance of LLMs on existing benchmarks. We hope that DyVal can shed light on future evaluation research of LLMs. Code is available at: https://github.com/microsoft/promptbench.
Paper Structure (62 sections, 4 theorems, 10 figures, 10 tables)

This paper contains 62 sections, 4 theorems, 10 figures, 10 tables.

Key Result

Theorem 3.1

Given a tree-based DAG with depth $d$ and width $w$, if the operation set for non-leaf nodes has $k$ distinct operations and the value set for leaf nodes contains $n$ distinct values, the probability that two independently generated DAGs are identical is: $P = \left(k^{\frac{w^{d-1}-1}{w-1}} \times

Figures (10)

  • Figure 1: The pipeline of the graph-informed DyVal. Up: the general evaluation framework; down: an arithmetic example. More details can be found at Sec. \ref{['sec-method-graph']} and Appendix \ref{['sec-append-detail']}.
  • Figure 2: Results on 7 tasks with complexity from D1 to D4 (averaged on 3 description orders and 3 seeds). Xwin-13B, phi-1.5, and WizardMath-13B are not shown as their results are all 0.
  • Figure 3: Human vs. LLMs results.
  • Figure 4: Failure modes distribution.
  • Figure 5: Comparison results across different complexity constraints.
  • ...and 5 more figures

Theorems & Definitions (6)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem C.1
  • proof
  • Theorem C.2
  • proof