Table of Contents
Fetching ...

COACH: Collaborative Agents for Contextual Highlighting -- A Multi-Agent Framework for Sports Video Analysis

Tsz-To Wong, Ching-Chun Huang, Hong-Han Shuai

TL;DR

COACH introduces a reconfigurable Multi-Agent System with a shared backbone to tackle multi-scale temporal reasoning in sports video analysis. By separating concerns into Orchestrator, Grounder, and Critic and guiding their collaboration with pre-defined SOPs and Structured CoT templates, COACH achieves superior temporal grounding and factual reasoning on badminton tasks compared with generalist video-language models. The framework is trained end-to-end on a badminton-focused COACH-Dataset and demonstrates notable gains in both fine-grained video QA and long-term summarization, while maintaining interpretability through its agent-specific reasoning traces. These results suggest that a modular, role-specialized, and policy-guided approach can generalize across tasks and scales, offering a scalable path toward robust, cross-task sports video intelligence.

Abstract

Intelligent sports video analysis demands a comprehensive understanding of temporal context, from micro-level actions to macro-level game strategies. Existing end-to-end models often struggle with this temporal hierarchy, offering solutions that lack generalization, incur high development costs for new tasks, and suffer from poor interpretability. To overcome these limitations, we propose a reconfigurable Multi-Agent System (MAS) as a foundational framework for sports video understanding. In our system, each agent functions as a distinct "cognitive tool" specializing in a specific aspect of analysis. The system's architecture is not confined to a single temporal dimension or task. By leveraging iterative invocation and flexible composition of these agents, our framework can construct adaptive pipelines for both short-term analytic reasoning (e.g., Rally QA) and long-term generative summarization (e.g., match summaries). We demonstrate the adaptability of this framework using two representative tasks in badminton analysis, showcasing its ability to bridge fine-grained event detection and global semantic organization. This work presents a paradigm shift towards a flexible, scalable, and interpretable system for robust, cross-task sports video intelligence. The project homepage is available at https://aiden1020.github.io/COACH-project-page

COACH: Collaborative Agents for Contextual Highlighting -- A Multi-Agent Framework for Sports Video Analysis

TL;DR

COACH introduces a reconfigurable Multi-Agent System with a shared backbone to tackle multi-scale temporal reasoning in sports video analysis. By separating concerns into Orchestrator, Grounder, and Critic and guiding their collaboration with pre-defined SOPs and Structured CoT templates, COACH achieves superior temporal grounding and factual reasoning on badminton tasks compared with generalist video-language models. The framework is trained end-to-end on a badminton-focused COACH-Dataset and demonstrates notable gains in both fine-grained video QA and long-term summarization, while maintaining interpretability through its agent-specific reasoning traces. These results suggest that a modular, role-specialized, and policy-guided approach can generalize across tasks and scales, offering a scalable path toward robust, cross-task sports video intelligence.

Abstract

Intelligent sports video analysis demands a comprehensive understanding of temporal context, from micro-level actions to macro-level game strategies. Existing end-to-end models often struggle with this temporal hierarchy, offering solutions that lack generalization, incur high development costs for new tasks, and suffer from poor interpretability. To overcome these limitations, we propose a reconfigurable Multi-Agent System (MAS) as a foundational framework for sports video understanding. In our system, each agent functions as a distinct "cognitive tool" specializing in a specific aspect of analysis. The system's architecture is not confined to a single temporal dimension or task. By leveraging iterative invocation and flexible composition of these agents, our framework can construct adaptive pipelines for both short-term analytic reasoning (e.g., Rally QA) and long-term generative summarization (e.g., match summaries). We demonstrate the adaptability of this framework using two representative tasks in badminton analysis, showcasing its ability to bridge fine-grained event detection and global semantic organization. This work presents a paradigm shift towards a flexible, scalable, and interpretable system for robust, cross-task sports video intelligence. The project homepage is available at https://aiden1020.github.io/COACH-project-page

Paper Structure

This paper contains 38 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: A high-level conceptual diagram illustrating the collaborative interaction between agents.
  • Figure 2: Shared components library
  • Figure 3: The multi-agent collaboration workflow in COACH. Based on the user's prompt, the Orchestrator Agent acts as an intent router, initiating one of two distinct collaboration policies : (Left) The Analytical Rally QA pipeline, which uses an Orchestrator-Critic loop to verify evidence and generate a factual text answer. (Right) The Generative Video Summarization pipeline, which uses a specialist Grounder-Critic loop to perform high-precision temporal localization, followed by the Media Composition Tool to assemble the final video output.