Table of Contents
Fetching ...

Sporthesia: Augmenting Sports Videos Using Natural Language

Chen Zhu-Tian, Qisen Yang, Xiao Xie, Johanna Beyer, Haijun Xia, Yingcai Wu, Hanspeter Pfister

TL;DR

This work tackles the challenge of turning natural-language sports insights into embedded visualizations within videos. It introduces Sporthesia, a three-step pipeline that detects visualizable text entities, maps them to visualizations, and schedules them to play with the video. Grounded in a formative study of 155 clips across six sports, the authors implement a proof-of-concept system with three components (Entity Detector, Entity Visualizer, Visualization Scheduler) and demonstrate two applications: authoring augmented videos from text and augmenting archived videos via audio-to-text cues. Technical evaluation yields an F1 score around 0.9 for entity detection, while expert feedback from eight sports analysts indicates high utility, effectiveness, and satisfaction, highlighting practical benefits and areas for refinement and generalization to broader sports domains.

Abstract

Augmented sports videos, which combine visualizations and video effects to present data in actual scenes, can communicate insights engagingly and thus have been increasingly popular for sports enthusiasts around the world. Yet, creating augmented sports videos remains a challenging task, requiring considerable time and video editing skills. On the other hand, sports insights are often communicated using natural language, such as in commentaries, oral presentations, and articles, but usually lack visual cues. Thus, this work aims to facilitate the creation of augmented sports videos by enabling analysts to directly create visualizations embedded in videos using insights expressed in natural language. To achieve this goal, we propose a three-step approach - 1) detecting visualizable entities in the text, 2) mapping these entities into visualizations, and 3) scheduling these visualizations to play with the video - and analyzed 155 sports video clips and the accompanying commentaries for accomplishing these steps. Informed by our analysis, we have designed and implemented Sporthesia, a proof-of-concept system that takes racket-based sports videos and textual commentaries as the input and outputs augmented videos. We demonstrate Sporthesia's applicability in two exemplar scenarios, i.e., authoring augmented sports videos using text and augmenting historical sports videos based on auditory comments. A technical evaluation shows that Sporthesia achieves high accuracy (F1-score of 0.9) in detecting visualizable entities in the text. An expert evaluation with eight sports analysts suggests high utility, effectiveness, and satisfaction with our language-driven authoring method and provides insights for future improvement and opportunities.

Sporthesia: Augmenting Sports Videos Using Natural Language

TL;DR

This work tackles the challenge of turning natural-language sports insights into embedded visualizations within videos. It introduces Sporthesia, a three-step pipeline that detects visualizable text entities, maps them to visualizations, and schedules them to play with the video. Grounded in a formative study of 155 clips across six sports, the authors implement a proof-of-concept system with three components (Entity Detector, Entity Visualizer, Visualization Scheduler) and demonstrate two applications: authoring augmented videos from text and augmenting archived videos via audio-to-text cues. Technical evaluation yields an F1 score around 0.9 for entity detection, while expert feedback from eight sports analysts indicates high utility, effectiveness, and satisfaction, highlighting practical benefits and areas for refinement and generalization to broader sports domains.

Abstract

Augmented sports videos, which combine visualizations and video effects to present data in actual scenes, can communicate insights engagingly and thus have been increasingly popular for sports enthusiasts around the world. Yet, creating augmented sports videos remains a challenging task, requiring considerable time and video editing skills. On the other hand, sports insights are often communicated using natural language, such as in commentaries, oral presentations, and articles, but usually lack visual cues. Thus, this work aims to facilitate the creation of augmented sports videos by enabling analysts to directly create visualizations embedded in videos using insights expressed in natural language. To achieve this goal, we propose a three-step approach - 1) detecting visualizable entities in the text, 2) mapping these entities into visualizations, and 3) scheduling these visualizations to play with the video - and analyzed 155 sports video clips and the accompanying commentaries for accomplishing these steps. Informed by our analysis, we have designed and implemented Sporthesia, a proof-of-concept system that takes racket-based sports videos and textual commentaries as the input and outputs augmented videos. We demonstrate Sporthesia's applicability in two exemplar scenarios, i.e., authoring augmented sports videos using text and augmenting historical sports videos based on auditory comments. A technical evaluation shows that Sporthesia achieves high accuracy (F1-score of 0.9) in detecting visualizable entities in the text. An expert evaluation with eight sports analysts suggests high utility, effectiveness, and satisfaction with our language-driven authoring method and provides insights for future improvement and opportunities.
Paper Structure (23 sections, 8 figures, 1 table)

This paper contains 23 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: A three-step approach to augment sports videos with embedded visualizations based on text commentary. The three steps include detecting visualizable entities in the text, mapping them to visualizations, and scheduling the visualizations in the video.
  • Figure 2: The average a) duration of videos and b) number of words of commentaries per sport in the collected dataset. c) The number of entities per category in different sports.
  • Figure 3: Sporthesia detects the visualizable entities in the text (a1) and groups them into semantic units (a2). Next, the entities are mapped to visualizations (b1) with arguments specified by the semantic units (b2). Finally, the system initializes and calibrates the schedules of the visualizations based on the reading time of the text and the video events (c). All three steps are built upon the video processing components.
  • Figure 4: a) The visualization of hit is manually specified, which takes two arguments, i.e., from and to. b) The visualization of crosscourt can be generated based on its text explanation in the tennis glossary, which is a variant of hit with a default argument, diagonal court.
  • Figure 5: Left: The text is converted into audio that initializes the appearance time of each visualization. This initialized schedule can be used to render the visualizations in analyst mode. Right: When rendering in play-by-play mode, the appearance times of some visualizations are further calibrated based on video events.
  • ...and 3 more figures