Table of Contents
Fetching ...

X-VARS: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model

Jan Held, Hani Itani, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck

TL;DR

The paper presents X-VARS, a multi-modal large language model designed to explain football refereeing decisions, built atop a fine-tuned CLIP encoder and a Video-ChatGPT-based LLM. It introduces SoccerNet-XFoul, a large dataset of over 10k video clips and 22k referee-annotated video-question-answer triplets with detailed explanations, enabling training and evaluation of explainable refereeing reasoning. Through a two-stage training paradigm, X-VARS achieves state-of-the-art performance on foul detection and severity and demonstrates explanation quality comparable to human referees in a dedicated human study, while revealing the importance of video tokens and ground-truth supervision for robust explanations. The work highlights the potential of explainable AI to support referees, increase transparency, and foster trust in automated sports analytics, with practical implications for future referee-assistance tools.

Abstract

The rapid advancement of artificial intelligence has led to significant improvements in automated decision-making. However, the increased performance of models often comes at the cost of explainability and transparency of their decision-making processes. In this paper, we investigate the capabilities of large language models to explain decisions, using football refereeing as a testing ground, given its decision complexity and subjectivity. We introduce the Explainable Video Assistant Referee System, X-VARS, a multi-modal large language model designed for understanding football videos from the point of view of a referee. X-VARS can perform a multitude of tasks, including video description, question answering, action recognition, and conducting meaningful conversations based on video content and in accordance with the Laws of the Game for football referees. We validate X-VARS on our novel dataset, SoccerNet-XFoul, which consists of more than 22k video-question-answer triplets annotated by over 70 experienced football referees. Our experiments and human study illustrate the impressive capabilities of X-VARS in interpreting complex football clips. Furthermore, we highlight the potential of X-VARS to reach human performance and support football referees in the future.

X-VARS: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model

TL;DR

The paper presents X-VARS, a multi-modal large language model designed to explain football refereeing decisions, built atop a fine-tuned CLIP encoder and a Video-ChatGPT-based LLM. It introduces SoccerNet-XFoul, a large dataset of over 10k video clips and 22k referee-annotated video-question-answer triplets with detailed explanations, enabling training and evaluation of explainable refereeing reasoning. Through a two-stage training paradigm, X-VARS achieves state-of-the-art performance on foul detection and severity and demonstrates explanation quality comparable to human referees in a dedicated human study, while revealing the importance of video tokens and ground-truth supervision for robust explanations. The work highlights the potential of explainable AI to support referees, increase transparency, and foster trust in automated sports analytics, with practical implications for future referee-assistance tools.

Abstract

The rapid advancement of artificial intelligence has led to significant improvements in automated decision-making. However, the increased performance of models often comes at the cost of explainability and transparency of their decision-making processes. In this paper, we investigate the capabilities of large language models to explain decisions, using football refereeing as a testing ground, given its decision complexity and subjectivity. We introduce the Explainable Video Assistant Referee System, X-VARS, a multi-modal large language model designed for understanding football videos from the point of view of a referee. X-VARS can perform a multitude of tasks, including video description, question answering, action recognition, and conducting meaningful conversations based on video content and in accordance with the Laws of the Game for football referees. We validate X-VARS on our novel dataset, SoccerNet-XFoul, which consists of more than 22k video-question-answer triplets annotated by over 70 experienced football referees. Our experiments and human study illustrate the impressive capabilities of X-VARS in interpreting complex football clips. Furthermore, we highlight the potential of X-VARS to reach human performance and support football referees in the future.
Paper Structure (12 sections, 6 equations, 4 figures, 4 tables)

This paper contains 12 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: SoccerNet-XFoul dataset. Examples of annotations from two different referees for the same foul. The second example illustrates the complexity and subjectivity of refereeing decisions.
  • Figure 2: Distribution of the most common words. The most frequent words are "foul" and "defender," followed by semantically related words related to football and referee actions and terms. There is thus a significant imbalance in the distribution.
  • Figure 3: Architecture of X-VARS. X-VARS is a visual language model based on a fine-tuned CLIP visual encoder to extract spatio-temporal video features and to obtain multi-task predictions regarding the type and severity of fouls. The linear layer connects the vision encoder to the language model by projection the video features in the text embedding dimension. We input the projected spatio-temporal features alongside the text predictions obtained by the two classification heads $\mathbf{C_{foul}}$ and $\mathbf{C_{sev}}$ (for the task of determining the type of foul and the task of determining if it is a foul and the corresponding severity) into the Vicuna-v1.1 model, initialized with weights from LLaVA.
  • Figure 4: Qualitative results. Although X-VARS has never been specifically fine-tuned for conversation, it has inherited its conversational capabilities from the pre-trained model. X-VARS demonstrates impressive discussion skills while being aligned with the video content and the Laws of the Game. (a) X-VARS is close to the ground truth and is able to accurately answer the user's question. (b) This example shows the subjectivity of foul situations. X-VARS interprets the foul as medium intensity, while the human referee interprets it as low intensity with no chance to play the ball.