Table of Contents
Fetching ...

When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives

Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Wenlin Yao, Hassan Foroosh, Dong Yu, Fei Liu

TL;DR

This work probes how large language models aggregate information to support analytical reasoning over longitudinal data, using sports narratives as a controlled testbed. It introduces SportsGen to synthesize diverse, controllable game narratives and evaluates reasoning with divide-and-conquer strategies (Player-Centric and Batch-Centric), plus the new Discounted Cumulative Accuracy metric to account for near-correct predictions. Key findings show that even strong models like GPT-4o can struggle with precise point aggregation, with performance highly sensitive to narrative density, complexity, and domain terminology; symbolic-context tests reveal continued reliance on natural-language cues. The work provides a practical benchmark and methodological framework for assessing and improving LLM reasoning in structured, data-rich narratives with real-world impact for longitudinal decision-making tasks.

Abstract

Reasoning is most powerful when an LLM accurately aggregates relevant information. We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives. To succeed at this task, an LLM must infer points from actions, identify related entities, attribute points accurately to players and teams, and compile key statistics to draw conclusions. We conduct comprehensive experiments with real NBA basketball data and present SportsGen, a new method to synthesize game narratives. By synthesizing data, we can rigorously evaluate LLMs' reasoning capabilities under complex scenarios with varying narrative lengths and density of information. Our findings show that most models, including GPT-4o, often fail to accurately aggregate basketball scores due to frequent scoring patterns. Open-source models like Llama-3 further suffer from significant score hallucinations. Finally, the effectiveness of reasoning is influenced by narrative complexity, information density, and domain-specific terms, highlighting the challenges in analytical reasoning tasks.

When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives

TL;DR

This work probes how large language models aggregate information to support analytical reasoning over longitudinal data, using sports narratives as a controlled testbed. It introduces SportsGen to synthesize diverse, controllable game narratives and evaluates reasoning with divide-and-conquer strategies (Player-Centric and Batch-Centric), plus the new Discounted Cumulative Accuracy metric to account for near-correct predictions. Key findings show that even strong models like GPT-4o can struggle with precise point aggregation, with performance highly sensitive to narrative density, complexity, and domain terminology; symbolic-context tests reveal continued reliance on natural-language cues. The work provides a practical benchmark and methodological framework for assessing and improving LLM reasoning in structured, data-rich narratives with real-world impact for longitudinal decision-making tasks.

Abstract

Reasoning is most powerful when an LLM accurately aggregates relevant information. We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives. To succeed at this task, an LLM must infer points from actions, identify related entities, attribute points accurately to players and teams, and compile key statistics to draw conclusions. We conduct comprehensive experiments with real NBA basketball data and present SportsGen, a new method to synthesize game narratives. By synthesizing data, we can rigorously evaluate LLMs' reasoning capabilities under complex scenarios with varying narrative lengths and density of information. Our findings show that most models, including GPT-4o, often fail to accurately aggregate basketball scores due to frequent scoring patterns. Open-source models like Llama-3 further suffer from significant score hallucinations. Finally, the effectiveness of reasoning is influenced by narrative complexity, information density, and domain-specific terms, highlighting the challenges in analytical reasoning tasks.
Paper Structure (24 sections, 1 equation, 11 figures, 4 tables)

This paper contains 24 sections, 1 equation, 11 figures, 4 tables.

Figures (11)

  • Figure 1: ESPN's NBA game play-by-play descriptions. We are particularly interested in exploring whether LLMs can perform analytical reasoning in a more focused and manageable context using divide-and-conquer strategies.
  • Figure 2: SportsGen, a new method that synthesizes sports narratives by modeling game dynamics.
  • Figure 3: (Left) Synthesized narratives with varying S:NS ratios. (Middle) Narratives grouped by # of scoring actions. (Right) Narratives grouped by # of tokens.
  • Figure 4: We examine four scenarios that increasingly alter the context to a more 'symbolic' format and evaluate the models' ability to calculate team points.
  • Figure 5: An example action graph created from NBA narratives. When a team has the ball, they are on offense and use actions such as passing, dribbling, shooting to score points. The defense tries to stop them by blocking, stealing, and rebounding. We refer to each sequence of these actions as a turn, and characterize it using a Markov graph. In the graph, each significant action is represented as a node, and transitions between nodes show how team members cooperate and execute their tactics. The node 'vs' denotes a matchup between two players, such as 'Nikola Jokic vs. Anthony Davis,' indicating their direct competition in key moments of the game. The graph begins and ends with special nodes that mark the start and end of a turn.
  • ...and 6 more figures