Table of Contents
Fetching ...

UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark

Hasnat Md Abdullah, Tian Liu, Kangda Wei, Shu Kong, Ruihong Huang

TL;DR

This work introduces UAL-Bench, the first comprehensive benchmark for unusual activity localization in videos, comprising three datasets (UAG-OOPS, UAG-SSBD, UAG-FunQA) and an instruction-tuning dataset (OOPS-UAG-Instruct). It evaluates three modeling paradigms: Vid-LLMs, a two-step VLM-LLM integration with LLMs for time-localization, and instruction-tuned Vid-LLMs, complemented by a new metric $R@1, TD \leq p$ to address edge-case evaluation. Empirically, VLM-LLM approaches excel at localizing extremely short-span events and predicting onset, while long-duration videos—especially autism-related content—pose significant challenges; instruction-tuning without time-aware encodings underperforms. The results illuminate the strengths and limits of foundation models for temporal localization and chart future directions toward time-aware encodings and broader domains to improve robust unusual-activity localization in practical settings.

Abstract

Localizing unusual activities, such as human errors or surveillance incidents, in videos holds practical significance. However, current video understanding models struggle with localizing these unusual events likely because of their insufficient representation in models' pretraining datasets. To explore foundation models' capability in localizing unusual activity, we introduce UAL-Bench, a comprehensive benchmark for unusual activity localization, featuring three video datasets: UAG-OOPS, UAG-SSBD, UAG-FunQA, and an instruction-tune dataset: OOPS-UAG-Instruct, to improve model capabilities. UAL-Bench evaluates three approaches: Video-Language Models (Vid-LLMs), instruction-tuned Vid-LLMs, and a novel integration of Vision-Language Models and Large Language Models (VLM-LLM). Our results show the VLM-LLM approach excels in localizing short-span unusual events and predicting their onset (start time) more accurately than Vid-LLMs. We also propose a new metric, R@1, TD <= p, to address limitations in existing evaluation methods. Our findings highlight the challenges posed by long-duration videos, particularly in autism diagnosis scenarios, and the need for further advancements in localization techniques. Our work not only provides a benchmark for unusual activity localization but also outlines the key challenges for existing foundation models, suggesting future research directions on this important task.

UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark

TL;DR

This work introduces UAL-Bench, the first comprehensive benchmark for unusual activity localization in videos, comprising three datasets (UAG-OOPS, UAG-SSBD, UAG-FunQA) and an instruction-tuning dataset (OOPS-UAG-Instruct). It evaluates three modeling paradigms: Vid-LLMs, a two-step VLM-LLM integration with LLMs for time-localization, and instruction-tuned Vid-LLMs, complemented by a new metric to address edge-case evaluation. Empirically, VLM-LLM approaches excel at localizing extremely short-span events and predicting onset, while long-duration videos—especially autism-related content—pose significant challenges; instruction-tuning without time-aware encodings underperforms. The results illuminate the strengths and limits of foundation models for temporal localization and chart future directions toward time-aware encodings and broader domains to improve robust unusual-activity localization in practical settings.

Abstract

Localizing unusual activities, such as human errors or surveillance incidents, in videos holds practical significance. However, current video understanding models struggle with localizing these unusual events likely because of their insufficient representation in models' pretraining datasets. To explore foundation models' capability in localizing unusual activity, we introduce UAL-Bench, a comprehensive benchmark for unusual activity localization, featuring three video datasets: UAG-OOPS, UAG-SSBD, UAG-FunQA, and an instruction-tune dataset: OOPS-UAG-Instruct, to improve model capabilities. UAL-Bench evaluates three approaches: Video-Language Models (Vid-LLMs), instruction-tuned Vid-LLMs, and a novel integration of Vision-Language Models and Large Language Models (VLM-LLM). Our results show the VLM-LLM approach excels in localizing short-span unusual events and predicting their onset (start time) more accurately than Vid-LLMs. We also propose a new metric, R@1, TD <= p, to address limitations in existing evaluation methods. Our findings highlight the challenges posed by long-duration videos, particularly in autism diagnosis scenarios, and the need for further advancements in localization techniques. Our work not only provides a benchmark for unusual activity localization but also outlines the key challenges for existing foundation models, suggesting future research directions on this important task.
Paper Structure (16 sections, 8 equations, 3 figures, 7 tables)

This paper contains 16 sections, 8 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Example of an unusual activity in a baseball game scene. In the 3rd and 4th frames, the ball unexpectedly strikes a batter's head, causing him to fall on the ground. This event is classified as an unusual action.
  • Figure 2: An illustration of our proposed Temporal Distance.
  • Figure 3: Comparison of explanations among the best-performing models from our experiments. VLM-LLM approach demonstrates a superior understanding of the scene compared to other models. Explanations highlighted in Red highlight indicates incorrect, while those in green signify the correct interpretation.