Visualization of Unstructured Sports Data -- An Example of Cricket Short Text Commentary
Swarup Ranjan Behera, Vijaya V Saradhi
TL;DR
This work addresses the gap where sports visualization predominantly uses structured data by introducing cricket short text commentary as an unstructured data source for visualization. It builds a computational framework using a confrontation matrix and Correspondence Analysis (CA) to extract strength and weakness rules for individual players, visualized with biplots and complemented by t-SNE clustering to reveal similar players. The approach is validated through expert comparison and Procrustes analysis, demonstrating reliable rule extraction and meaningful player groupings, with data and code publicly available. The methodology offers a new, ball-by-ball contextual perspective for analysts, coaches, and teams to augment strategic decision-making in cricket.
Abstract
Sports visualization focuses on the use of structured data, such as box-score data and tracking data. Unstructured data sources pertaining to sports are available in various places such as blogs, social media posts, and online news articles. Sports visualization methods either not fully exploited the information present in these sources or the proposed visualizations through the use of these sources did not augment to the body of sports visualization methods. We propose the use of unstructured data, namely cricket short text commentary for visualization. The short text commentary data is used for constructing individual player's strength rules and weakness rules. A computationally feasible definition for player's strength rule and weakness rule is proposed. A visualization method for the constructed rules is presented. In addition, players having similar strength rules or weakness rules is computed and visualized. We demonstrate the usefulness of short text commentary in visualization by analyzing the strengths and weaknesses of cricket players using more than one million text commentaries. We validate the constructed rules through two validation methods. The collected data, source code, and obtained results on more than 500 players are made publicly available.
