Table of Contents
Fetching ...

Vygotsky Distance: Measure for Benchmark Task Similarity

Maxim K. Surkov, Ivan P. Yamshchikov

TL;DR

The paper tackles inefficiencies in NLP benchmarking by introducing Vygotsky distance, a learner-centric similarity metric defined via relative model performance across tasks and formalized with a permutation-inversion distance $w(\pi,\sigma)=inv(\pi\circ\sigma^{-1})$. It represents benchmarks as weighted graphs and analyzes their structure through minimum weight spanning trees, enabling principled benchmark compression. A practical algorithm splits benchmarks into public/private leaderboards and uses SVM, GP, and MLP to predict task-level model rankings or scores, achieving about $80\%$ classification accuracy and RMSE around $0.2$ on best subsets at a compression rate near $40\%$. Experiments on GLUE, SuperGLUE, CLUE, and RussianSuperGLUE demonstrate that many benchmarks can be reduced substantially without sacrificing the ability to assess generalization, offering a scalable, principled approach to benchmark design and validation in NLP.

Abstract

Evaluation plays a significant role in modern natural language processing. Most modern NLP benchmarks consist of arbitrary sets of tasks that neither guarantee any generalization potential for the model once applied outside the test set nor try to minimize the resource consumption needed for model evaluation. This paper presents a theoretical instrument and a practical algorithm to calculate similarity between benchmark tasks, we call this similarity measure "Vygotsky distance". The core idea of this similarity measure is that it is based on relative performance of the "students" on a given task, rather that on the properties of the task itself. If two tasks are close to each other in terms of Vygotsky distance the models tend to have similar relative performance on them. Thus knowing Vygotsky distance between tasks one can significantly reduce the number of evaluation tasks while maintaining a high validation quality. Experiments on various benchmarks, including GLUE, SuperGLUE, CLUE, and RussianSuperGLUE, demonstrate that a vast majority of NLP benchmarks could be at least 40% smaller in terms of the tasks included. Most importantly, Vygotsky distance could also be used for the validation of new tasks thus increasing the generalization potential of the future NLP models.

Vygotsky Distance: Measure for Benchmark Task Similarity

TL;DR

The paper tackles inefficiencies in NLP benchmarking by introducing Vygotsky distance, a learner-centric similarity metric defined via relative model performance across tasks and formalized with a permutation-inversion distance . It represents benchmarks as weighted graphs and analyzes their structure through minimum weight spanning trees, enabling principled benchmark compression. A practical algorithm splits benchmarks into public/private leaderboards and uses SVM, GP, and MLP to predict task-level model rankings or scores, achieving about classification accuracy and RMSE around on best subsets at a compression rate near . Experiments on GLUE, SuperGLUE, CLUE, and RussianSuperGLUE demonstrate that many benchmarks can be reduced substantially without sacrificing the ability to assess generalization, offering a scalable, principled approach to benchmark design and validation in NLP.

Abstract

Evaluation plays a significant role in modern natural language processing. Most modern NLP benchmarks consist of arbitrary sets of tasks that neither guarantee any generalization potential for the model once applied outside the test set nor try to minimize the resource consumption needed for model evaluation. This paper presents a theoretical instrument and a practical algorithm to calculate similarity between benchmark tasks, we call this similarity measure "Vygotsky distance". The core idea of this similarity measure is that it is based on relative performance of the "students" on a given task, rather that on the properties of the task itself. If two tasks are close to each other in terms of Vygotsky distance the models tend to have similar relative performance on them. Thus knowing Vygotsky distance between tasks one can significantly reduce the number of evaluation tasks while maintaining a high validation quality. Experiments on various benchmarks, including GLUE, SuperGLUE, CLUE, and RussianSuperGLUE, demonstrate that a vast majority of NLP benchmarks could be at least 40% smaller in terms of the tasks included. Most importantly, Vygotsky distance could also be used for the validation of new tasks thus increasing the generalization potential of the future NLP models.
Paper Structure (15 sections, 14 equations, 11 figures, 2 tables)

This paper contains 15 sections, 14 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Minimum Weight Spanning Tree of the GLUE benchmark.
  • Figure 2: Minimum Weight Spanning Tree of the SuperGLUE benchmark.
  • Figure 3: Distribution of benchmark sizes among the entire "Papers With Code" database.
  • Figure 4: Machine learning areas coverage distribution where the contribution of each field is calculated as the total number of tasks.
  • Figure 5: Example of splitting SuperGLUE into public and private leaderboards. Cyan-colored columns (BoolQ, COPA) belong to the public subset, magenta-colored columns (CB, MultiRC) belong to the private subset.
  • ...and 6 more figures