COACH: Collaborative Agents for Contextual Highlighting -- A Multi-Agent Framework for Sports Video Analysis
Tsz-To Wong, Ching-Chun Huang, Hong-Han Shuai
TL;DR
COACH introduces a reconfigurable Multi-Agent System with a shared backbone to tackle multi-scale temporal reasoning in sports video analysis. By separating concerns into Orchestrator, Grounder, and Critic and guiding their collaboration with pre-defined SOPs and Structured CoT templates, COACH achieves superior temporal grounding and factual reasoning on badminton tasks compared with generalist video-language models. The framework is trained end-to-end on a badminton-focused COACH-Dataset and demonstrates notable gains in both fine-grained video QA and long-term summarization, while maintaining interpretability through its agent-specific reasoning traces. These results suggest that a modular, role-specialized, and policy-guided approach can generalize across tasks and scales, offering a scalable path toward robust, cross-task sports video intelligence.
Abstract
Intelligent sports video analysis demands a comprehensive understanding of temporal context, from micro-level actions to macro-level game strategies. Existing end-to-end models often struggle with this temporal hierarchy, offering solutions that lack generalization, incur high development costs for new tasks, and suffer from poor interpretability. To overcome these limitations, we propose a reconfigurable Multi-Agent System (MAS) as a foundational framework for sports video understanding. In our system, each agent functions as a distinct "cognitive tool" specializing in a specific aspect of analysis. The system's architecture is not confined to a single temporal dimension or task. By leveraging iterative invocation and flexible composition of these agents, our framework can construct adaptive pipelines for both short-term analytic reasoning (e.g., Rally QA) and long-term generative summarization (e.g., match summaries). We demonstrate the adaptability of this framework using two representative tasks in badminton analysis, showcasing its ability to bridge fine-grained event detection and global semantic organization. This work presents a paradigm shift towards a flexible, scalable, and interpretable system for robust, cross-task sports video intelligence. The project homepage is available at https://aiden1020.github.io/COACH-project-page
