Table of Contents
Fetching ...

Weisfeiler and Leman Go Measurement Modeling: Probing the Validity of the WL Test

Arjun Subramonian, Adina Williams, Maximilian Nickel, Yizhou Sun, Levent Sagun

TL;DR

This paper reveals systematic misalignments between graph ML practitioners’ conceptions of expressive power and the $k$-WL test, arguing that $k$-WL does not guarantee isometry, may be task-irrelevant, and can conflict with generalization or trustworthiness. Using a measurement-modeling lens, a survey of $n=18$ practitioners, and benchmark auditing, the authors show that 1-WL often suffices on common benchmarks and that GIN representations can align more with task loss than with WL colorings. They also demonstrate social risks like privacy concerns arising from WL-based discriminability and advocate extensional, task-driven benchmarks to measure expressive power, complemented by guiding questions for benchmark construction. The work emphasizes a shift from intensional, WL-centric definitions to extensional, benchmark-based assessments that better capture real-world goals such as fairness, robustness, and privacy, thereby improving transparency and trust in graph ML research. Overall, the paper provides a framework for task-driven expressive-power measurement and a set of practical questions to design benchmarks that reflect real-world graph tasks and societal considerations.

Abstract

The expressive power of graph neural networks is usually measured by comparing how many pairs of graphs or nodes an architecture can possibly distinguish as non-isomorphic to those distinguishable by the $k$-dimensional Weisfeiler-Leman ($k$-WL) test. In this paper, we uncover misalignments between graph machine learning practitioners' conceptualizations of expressive power and $k$-WL through a systematic analysis of the reliability and validity of $k$-WL. We conduct a survey ($n = 18$) of practitioners to surface their conceptualizations of expressive power and their assumptions about $k$-WL. In contrast to practitioners' beliefs, our analysis (which draws from graph theory and benchmark auditing) reveals that $k$-WL does not guarantee isometry, can be irrelevant to real-world graph tasks, and may not promote generalization or trustworthiness. We argue for extensional definitions and measurement of expressive power based on benchmarks. We further contribute guiding questions for constructing such benchmarks, which is critical for graph machine learning practitioners to develop and transparently communicate our understandings of expressive power.

Weisfeiler and Leman Go Measurement Modeling: Probing the Validity of the WL Test

TL;DR

This paper reveals systematic misalignments between graph ML practitioners’ conceptions of expressive power and the -WL test, arguing that -WL does not guarantee isometry, may be task-irrelevant, and can conflict with generalization or trustworthiness. Using a measurement-modeling lens, a survey of practitioners, and benchmark auditing, the authors show that 1-WL often suffices on common benchmarks and that GIN representations can align more with task loss than with WL colorings. They also demonstrate social risks like privacy concerns arising from WL-based discriminability and advocate extensional, task-driven benchmarks to measure expressive power, complemented by guiding questions for benchmark construction. The work emphasizes a shift from intensional, WL-centric definitions to extensional, benchmark-based assessments that better capture real-world goals such as fairness, robustness, and privacy, thereby improving transparency and trust in graph ML research. Overall, the paper provides a framework for task-driven expressive-power measurement and a set of practical questions to design benchmarks that reflect real-world graph tasks and societal considerations.

Abstract

The expressive power of graph neural networks is usually measured by comparing how many pairs of graphs or nodes an architecture can possibly distinguish as non-isomorphic to those distinguishable by the -dimensional Weisfeiler-Leman (-WL) test. In this paper, we uncover misalignments between graph machine learning practitioners' conceptualizations of expressive power and -WL through a systematic analysis of the reliability and validity of -WL. We conduct a survey () of practitioners to surface their conceptualizations of expressive power and their assumptions about -WL. In contrast to practitioners' beliefs, our analysis (which draws from graph theory and benchmark auditing) reveals that -WL does not guarantee isometry, can be irrelevant to real-world graph tasks, and may not promote generalization or trustworthiness. We argue for extensional definitions and measurement of expressive power based on benchmarks. We further contribute guiding questions for constructing such benchmarks, which is critical for graph machine learning practitioners to develop and transparently communicate our understandings of expressive power.
Paper Structure (61 sections, 1 equation, 34 figures, 5 tables)

This paper contains 61 sections, 1 equation, 34 figures, 5 tables.

Figures (34)

  • Figure 1: The top row depicts graphs prior to running 1-WL, and the bottom row depicts the graphs and their colors after running 1-WL till convergence. For simplicity, all nodes have the same initial features.
  • Figure 2: Distributions of WL kernel similarities and GIN encoder representation similarities of graph pairs with different vs. the same labels.
  • Figure 3: 1-WL non-distinguishable graph pairs from the MUTAG benchmark (after three iterations).
  • Figure 4: Adjusted mutual information (AMI) between different benchmark partitions.
  • Figure 5: Distributions of WL kernel similarities and GIN encoder representation similarities of graph pairs with different vs. the same labels.
  • ...and 29 more figures