Table of Contents
Fetching ...

SportsMetrics: Blending Text and Numerical Data to Understand Information Fusion in LLMs

Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Hassan Foroosh, Dong Yu, Fei Liu

TL;DR

SportsMetrics presents a benchmark to evaluate LLMs on numerical reasoning and information fusion by processing long, text-rich play-by-play narratives from NBA and NFL games. It introduces four adversarial tasks—New Rule, Swap, Shuffle, and planning-based data queries—to probe LLMs' adaptability, robustness, and memory for complex data. The evaluation emphasizes a JSON-based working memory and uses domain-specific scoring metrics like NBA Game Score and NCAA Passing Efficiency to quantify performance. Findings show long-context LLMs generally outperform standard models, highlighting the importance of context length for accurate numerical tracking in long narratives. The benchmark offers a practical, sports-centric framework with potential extensions to multiplayer gaming and collaborative analytics.

Abstract

Large language models hold significant potential for integrating various data types, such as text documents and database records, for advanced analytics. However, blending text and numerical data presents substantial challenges. LLMs need to process and cross-reference entities and numbers, handle data inconsistencies and redundancies, and develop planning capabilities such as building a working memory for managing complex data queries. In this paper, we introduce four novel tasks centered around sports data analytics to evaluate the numerical reasoning and information fusion capabilities of LLMs. These tasks involve providing LLMs with detailed, play-by-play sports game descriptions, then challenging them with adversarial scenarios such as new game rules, longer durations, scrambled narratives, and analyzing key statistics in game summaries. We conduct extensive experiments on NBA and NFL games to assess the performance of LLMs on these tasks. Our benchmark, SportsMetrics, introduces a new mechanism for assessing LLMs' numerical reasoning and fusion skills.

SportsMetrics: Blending Text and Numerical Data to Understand Information Fusion in LLMs

TL;DR

SportsMetrics presents a benchmark to evaluate LLMs on numerical reasoning and information fusion by processing long, text-rich play-by-play narratives from NBA and NFL games. It introduces four adversarial tasks—New Rule, Swap, Shuffle, and planning-based data queries—to probe LLMs' adaptability, robustness, and memory for complex data. The evaluation emphasizes a JSON-based working memory and uses domain-specific scoring metrics like NBA Game Score and NCAA Passing Efficiency to quantify performance. Findings show long-context LLMs generally outperform standard models, highlighting the importance of context length for accurate numerical tracking in long narratives. The benchmark offers a practical, sports-centric framework with potential extensions to multiplayer gaming and collaborative analytics.

Abstract

Large language models hold significant potential for integrating various data types, such as text documents and database records, for advanced analytics. However, blending text and numerical data presents substantial challenges. LLMs need to process and cross-reference entities and numbers, handle data inconsistencies and redundancies, and develop planning capabilities such as building a working memory for managing complex data queries. In this paper, we introduce four novel tasks centered around sports data analytics to evaluate the numerical reasoning and information fusion capabilities of LLMs. These tasks involve providing LLMs with detailed, play-by-play sports game descriptions, then challenging them with adversarial scenarios such as new game rules, longer durations, scrambled narratives, and analyzing key statistics in game summaries. We conduct extensive experiments on NBA and NFL games to assess the performance of LLMs on these tasks. Our benchmark, SportsMetrics, introduces a new mechanism for assessing LLMs' numerical reasoning and fusion skills.
Paper Structure (13 sections, 8 figures, 3 tables)

This paper contains 13 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Play-by-plays of an NBA game. We include timestamps, player actions, team affiliations and a game recap. Total points for both teams are indicated in dotted circles and are withheld from LLMs.
  • Figure 2: (Top Left) We examine the impact of changing game rules on final scores. For basketball, scoring events such as free throws, three-pointers, field goals, vary from 1 to 3 points. We ask LLMs to maintain these scoring events but under a new rule where each is worth only 1 point. (Bottom Left) We randomly swapped player team affiliations in the table without altering the game's play-by-play records. (Right) LLMs are provided with detailed play-by-play descriptions of a sports game and player team affiliations. Their job is to use this information to update key game statistics in a JSON format.
  • Figure 3: We adopt the NBA's Game Score, originally designed for player evaluation, to measure a team's overall efficiency. For American football, we apply NCAA's Passing Efficiency formula.
  • Figure 4: An LLM fills in missing key statistics in game summaries through a three-step process. Initially, the LLM creates an internal JSON object as its memory. It then enriches this memory by adding necessary game or player statistics, where all values are set to null, and further reflects on whether this memory is sufficient to accomplish the task. Lastly, the LLM uses detailed play-by-play and team-player data to update the JSON object's values; it finally utilizes this updated memory to fill in the blanks in the game summary.
  • Figure 5: Effective working memory is key in this task. The variance in memory structure arises because we allowed each LLM to generate its JSON object as working memory, without enforcing a uniform schema. This step allows us to explore how each model organizes its memory to complete the task. Note that Claude's 'null' values represent an initial state rather than an inability to aggregate information.
  • ...and 3 more figures