Table of Contents
Fetching ...

Towards AI-Powered Video Assistant Referee System (VARS) for Association Football

Jan Held, Anthony Cioppa, Silvio Giancola, Abdullah Hamdi, Christel Devue, Bernard Ghanem, Marc Van Droogenbroeck

TL;DR

This work introduces VARS, a semi-automated, multi-view video analysis system to assist football referees by flagging probable errors without replacing human judgment. Leveraging an attention-based fusion over multiple camera views and a pre-trained MViT encoder, VARS jointly predicts the foul type and offense severity, trained end-to-end on SoccerNet-MVFoul data. The results show state-of-the-art performance on this dataset, with notable improvements over pooling baselines and a compelling speed advantage, though human performance remains higher in accuracy. A comprehensive human study reveals the subjective nature of refereeing decisions and highlights VARS' potential as a fast, scalable decision-support tool for leagues with limited resources.

Abstract

Over the past decade, the technology used by referees in football has improved substantially, enhancing the fairness and accuracy of decisions. This progress has culminated in the implementation of the Video Assistant Referee (VAR), an innovation that enables backstage referees to review incidents on the pitch from multiple points of view. However, the VAR is currently limited to professional leagues due to its expensive infrastructure and the lack of referees worldwide. In this paper, we present the semi-automated Video Assistant Referee System (VARS) that leverages the latest findings in multi-view video analysis. VARS sets a new state-of-the-art on the SoccerNet-MVFoul dataset, a multi-view video dataset of football fouls. Our VARS achieves a new state-of-the-art on the SoccerNet-MVFoul dataset by recognizing the type of foul in 50% of instances and the appropriate sanction in 46% of cases. Finally, we conducted a comparative study to investigate human performance in classifying fouls and their corresponding severity and compared these findings to our VARS. The results of our study highlight the potential of our VARS to reach human performance and support football refereeing across all levels of professional and amateur federations.

Towards AI-Powered Video Assistant Referee System (VARS) for Association Football

TL;DR

This work introduces VARS, a semi-automated, multi-view video analysis system to assist football referees by flagging probable errors without replacing human judgment. Leveraging an attention-based fusion over multiple camera views and a pre-trained MViT encoder, VARS jointly predicts the foul type and offense severity, trained end-to-end on SoccerNet-MVFoul data. The results show state-of-the-art performance on this dataset, with notable improvements over pooling baselines and a compelling speed advantage, though human performance remains higher in accuracy. A comprehensive human study reveals the subjective nature of refereeing decisions and highlights VARS' potential as a fast, scalable decision-support tool for leagues with limited resources.

Abstract

Over the past decade, the technology used by referees in football has improved substantially, enhancing the fairness and accuracy of decisions. This progress has culminated in the implementation of the Video Assistant Referee (VAR), an innovation that enables backstage referees to review incidents on the pitch from multiple points of view. However, the VAR is currently limited to professional leagues due to its expensive infrastructure and the lack of referees worldwide. In this paper, we present the semi-automated Video Assistant Referee System (VARS) that leverages the latest findings in multi-view video analysis. VARS sets a new state-of-the-art on the SoccerNet-MVFoul dataset, a multi-view video dataset of football fouls. Our VARS achieves a new state-of-the-art on the SoccerNet-MVFoul dataset by recognizing the type of foul in 50% of instances and the appropriate sanction in 46% of cases. Finally, we conducted a comparative study to investigate human performance in classifying fouls and their corresponding severity and compared these findings to our VARS. The results of our study highlight the potential of our VARS to reach human performance and support football refereeing across all levels of professional and amateur federations.
Paper Structure (11 sections, 7 equations, 5 figures, 3 tables)

This paper contains 11 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Architecture of our Video Assistant Referee System. From multi-view video clips input, our system encodes per-view video features ($\mathbf{E}$), aggregates the view features ($\mathbf{A}$), and classifies different properties ($\mathbf{C_{Foul}}$ and $\mathbf{C_{Off}}$). UserColor The figure is inspired byHeld2023VARS.
  • Figure 2: Architecture of the attention block. "MatMul" represents matrix multiplication, "T" denotes transpose, "Norm" signifies normalization, and "SumRow" indicates the process of summing each row.
  • Figure 3: Performance evaluation for different dataset sizes. 100% of the dataset corresponds to $2{,}319$ actions. For each dataset size, we independently trained and tested the model $10$ times. The tests were all performed on the same test set. The error bar corresponds to the standard deviation. For 0% of the dataset, we indicate the accuracy by taking a random decision.
  • Figure 4: Qualitative results. VARS prediction on two examples where the attention score of each view is given in percentage. The ground truth is given in bold and the model prediction with the confidence is given in italic.
  • Figure 5: Example of the subjectivity of human choices. Decisions taken by our participants: "No offense", "Offense + No card", and "Offense + Yellow card".