Table of Contents
Fetching ...

The ProfessionAl Go annotation datasEt (PAGE)

Yifan Gao, Danni Zhang, Haoyue Li

TL;DR

PAGE delivers the first large-scale, extensively annotated dataset of professional Go games, combining 98,525 records with rich metadata and KataGo-derived in-game statistics to enable rigorous data-driven analysis. The authors demonstrate PAGE’s value through three downstream tasks: gender participation analysis, blunder prediction using CNN/Transformer architectures, and game outcome prediction with multiple ML models, achieving strong results (e.g., CatBoost 75.3% accuracy). They also discuss future directions in advanced statistics, behavior modeling, rating systems, and live commentary, highlighting PAGE's potential to catalyze research in game analytics and psychology. The work provides a practical, publicly available resource that bridges Go studies with broader data science and cognitive-science questions, supporting both methodological development and empirical studies of human decision-making in a high-skill domain.

Abstract

The game of Go has been highly under-researched due to the lack of game records and analysis tools. In recent years, the increasing number of professional competitions and the advent of AlphaZero-based algorithms provide an excellent opportunity for analyzing human Go games on a large scale. In this paper, we present the ProfessionAl Go annotation datasEt (PAGE), containing 98,525 games played by 2,007 professional players and spans over 70 years. The dataset includes rich AI analysis results for each move. Moreover, PAGE provides detailed metadata for every player and game after manual cleaning and labeling. Beyond the preliminary analysis of the dataset, we provide sample tasks that benefit from our dataset to demonstrate the potential application of PAGE in multiple research directions. To the best of our knowledge, PAGE is the first dataset with extensive annotation in the game of Go. This work is an extended version of [1] where we perform a more detailed description, analysis, and application.

The ProfessionAl Go annotation datasEt (PAGE)

TL;DR

PAGE delivers the first large-scale, extensively annotated dataset of professional Go games, combining 98,525 records with rich metadata and KataGo-derived in-game statistics to enable rigorous data-driven analysis. The authors demonstrate PAGE’s value through three downstream tasks: gender participation analysis, blunder prediction using CNN/Transformer architectures, and game outcome prediction with multiple ML models, achieving strong results (e.g., CatBoost 75.3% accuracy). They also discuss future directions in advanced statistics, behavior modeling, rating systems, and live commentary, highlighting PAGE's potential to catalyze research in game analytics and psychology. The work provides a practical, publicly available resource that bridges Go studies with broader data science and cognitive-science questions, supporting both methodological development and empirical studies of human decision-making in a high-skill domain.

Abstract

The game of Go has been highly under-researched due to the lack of game records and analysis tools. In recent years, the increasing number of professional competitions and the advent of AlphaZero-based algorithms provide an excellent opportunity for analyzing human Go games on a large scale. In this paper, we present the ProfessionAl Go annotation datasEt (PAGE), containing 98,525 games played by 2,007 professional players and spans over 70 years. The dataset includes rich AI analysis results for each move. Moreover, PAGE provides detailed metadata for every player and game after manual cleaning and labeling. Beyond the preliminary analysis of the dataset, we provide sample tasks that benefit from our dataset to demonstrate the potential application of PAGE in multiple research directions. To the best of our knowledge, PAGE is the first dataset with extensive annotation in the game of Go. This work is an extended version of [1] where we perform a more detailed description, analysis, and application.
Paper Structure (31 sections, 5 figures, 7 tables)

This paper contains 31 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: In-game statistics and metadata in the game between Ke Jie and Lee Sedol in the final of the 2nd Mlily Cup. It is the core property of the PAGE.
  • Figure 2: Illustrations of dataset statistics. (a) distribution of the age of players in different generations; (b) game counts in years; (c) distribution of game lengths; (d) Mean move similarity in years; (e) mean loss win rate with different ratings in years; (f) mean loss score with different ratings in years.
  • Figure 3: Observed and expected rankings of the WHR ratings for the top 100 female players. The Red line is the exact rank, and the blue line is the expected rank. The dotted lines represent the quantiles $r_{low}$ and $r_{high}$.
  • Figure 4: Observed and expected sex differences in WHR ratings. The red line is the actual score differences, and the blue line represents score differences attributed to different participation rates.
  • Figure 5: The month-by-month trend of the explanation rate of the actual rating differences. The dotted lines represent the upper and lower bounds of the observed values.