Table of Contents
Fetching ...

Causality can systematically address the monsters under the bench(marks)

Felix Leeb, Zhijing Jin, Bernhard Schölkopf

TL;DR

This paper tackles unreliable evaluation of large models due to monsters such as biases, leakage, and misleading benchmarks. It advocates a causality-driven experimental design using Common Abstract Topologies (CATs) to make underlying assumptions explicit and to guide analysis. Through case studies in confounding, mediation, and spurious correlations, it demonstrates how explicit causal framing clarifies hypotheses, enables principled tests (e.g., $NDE$, $NIE$, $CDE$, $ACE$), and informs new evaluation approaches. The approach promises more robust, transparent, and trustworthy progress in empirical machine learning by unifying disparate issues under a practical, hypothesis-driven framework.

Abstract

Effective and reliable evaluation is essential for advancing empirical machine learning. However, the increasing accessibility of generalist models and the progress towards ever more complex, high-level tasks make systematic evaluation more challenging. Benchmarks are plagued by various biases, artifacts, or leakage, while models may behave unreliably due to poorly explored failure modes. Haphazard treatments and inconsistent formulations of such "monsters" can contribute to a duplication of efforts, a lack of trust in results, and unsupported inferences. In this position paper, we argue causality offers an ideal framework to systematically address these challenges. By making causal assumptions in an approach explicit, we can faithfully model phenomena, formulate testable hypotheses with explanatory power, and leverage principled tools for analysis. To make causal model design more accessible, we identify several useful Common Abstract Topologies (CATs) in causal graphs which help gain insight into the reasoning abilities in large language models. Through a series of case studies, we demonstrate how the precise yet pragmatic language of causality clarifies the strengths and limitations of a method and inspires new approaches for systematic progress.

Causality can systematically address the monsters under the bench(marks)

TL;DR

This paper tackles unreliable evaluation of large models due to monsters such as biases, leakage, and misleading benchmarks. It advocates a causality-driven experimental design using Common Abstract Topologies (CATs) to make underlying assumptions explicit and to guide analysis. Through case studies in confounding, mediation, and spurious correlations, it demonstrates how explicit causal framing clarifies hypotheses, enables principled tests (e.g., , , , ), and informs new evaluation approaches. The approach promises more robust, transparent, and trustworthy progress in empirical machine learning by unifying disparate issues under a practical, hypothesis-driven framework.

Abstract

Effective and reliable evaluation is essential for advancing empirical machine learning. However, the increasing accessibility of generalist models and the progress towards ever more complex, high-level tasks make systematic evaluation more challenging. Benchmarks are plagued by various biases, artifacts, or leakage, while models may behave unreliably due to poorly explored failure modes. Haphazard treatments and inconsistent formulations of such "monsters" can contribute to a duplication of efforts, a lack of trust in results, and unsupported inferences. In this position paper, we argue causality offers an ideal framework to systematically address these challenges. By making causal assumptions in an approach explicit, we can faithfully model phenomena, formulate testable hypotheses with explanatory power, and leverage principled tools for analysis. To make causal model design more accessible, we identify several useful Common Abstract Topologies (CATs) in causal graphs which help gain insight into the reasoning abilities in large language models. Through a series of case studies, we demonstrate how the precise yet pragmatic language of causality clarifies the strengths and limitations of a method and inspires new approaches for systematic progress.

Paper Structure

This paper contains 18 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Growth of reasoning papers in ACL Anthology, among which the concept of "causality" is not growing at the same rate, suggesting that NLP is underutilizing causality.
  • Figure 2: This dendrogram shows the co-occurrences of causal and causality-adjacent terms of papers that contain "reasoning" in the abstracts (total 3181 papers) from the ACL anthology from the past 10 years. The numbers in parentheses indicate the number of papers that mention the term. Note, that the very first split separates all the causality-related terms from the rest of the terms, suggesting relatively poor co-occurrence with other invariably related concepts.
  • Figure 3: Sketch of a conceptual causal model treating the prompt (i.e. surface form) as a mediator between the underlying problem or task of interest and the model's response.
  • Figure 4: Various causal framings based on the approach of razeghiImpactPretrainingTerm2022. (a) A simple interpretation of their approach using the confounding CAT. (b) A causal framing for an alternative approach where we do not consider the term frequency, and instead observe a spurious correlation. (c) A more cautious causal framing that combines CATs to avoid claiming that the term frequency causally affects the model's accuracy (as is consistent with the authors' approach).