Table of Contents
Fetching ...

BEAR: A Video Dataset For Fine-grained Behaviors Recognition Oriented with Action and Environment Factors

Chengyang Hu, Yuduo Chen, Lizhuang Ma

TL;DR

BEAR tackles fine-grained video behavior recognition by decoupling environment and action factors and introduces two protocol families, FG-BSE and FG-BSA, with subprotocols to control for similar environments and similar actions. The dataset comprises diverse, well‑controlled scenarios with a broad train/test split and wild variations, enabling rigorous multi‑modal benchmarking. A comprehensive empirical study evaluates RGB, optical flow, skeleton, and text modalities across multiple models, including zero‑shot Text modality with VideoCLIP, revealing distinct role patterns: RGB primarily captures environment cues, optical flow emphasizes action and its environmental context, skeleton focuses on action, and text remains data‑limited but environment‑aware with leveled prompts. These findings offer practical guidance for feature learning and benchmark design in video understanding and motivate further dataset and modality research in fine‑grained behavior recognition.

Abstract

Behavior recognition is an important task in video representation learning. An essential aspect pertains to effective feature learning conducive to behavior recognition. Recently, researchers have started to study fine-grained behavior recognition, which provides similar behaviors and encourages the model to concern with more details of behaviors with effective features for distinction. However, previous fine-grained behaviors limited themselves to controlling partial information to be similar, leading to an unfair and not comprehensive evaluation of existing works. In this work, we develop a new video fine-grained behavior dataset, named BEAR, which provides fine-grained (i.e. similar) behaviors that uniquely focus on two primary factors defining behavior: Environment and Action. It includes two fine-grained behavior protocols including Fine-grained Behavior with Similar Environments and Fine-grained Behavior with Similar Actions as well as multiple sub-protocols as different scenarios. Furthermore, with this new dataset, we conduct multiple experiments with different behavior recognition models. Our research primarily explores the impact of input modality, a critical element in studying the environmental and action-based aspects of behavior recognition. Our experimental results yield intriguing insights that have substantial implications for further research endeavors.

BEAR: A Video Dataset For Fine-grained Behaviors Recognition Oriented with Action and Environment Factors

TL;DR

BEAR tackles fine-grained video behavior recognition by decoupling environment and action factors and introduces two protocol families, FG-BSE and FG-BSA, with subprotocols to control for similar environments and similar actions. The dataset comprises diverse, well‑controlled scenarios with a broad train/test split and wild variations, enabling rigorous multi‑modal benchmarking. A comprehensive empirical study evaluates RGB, optical flow, skeleton, and text modalities across multiple models, including zero‑shot Text modality with VideoCLIP, revealing distinct role patterns: RGB primarily captures environment cues, optical flow emphasizes action and its environmental context, skeleton focuses on action, and text remains data‑limited but environment‑aware with leveled prompts. These findings offer practical guidance for feature learning and benchmark design in video understanding and motivate further dataset and modality research in fine‑grained behavior recognition.

Abstract

Behavior recognition is an important task in video representation learning. An essential aspect pertains to effective feature learning conducive to behavior recognition. Recently, researchers have started to study fine-grained behavior recognition, which provides similar behaviors and encourages the model to concern with more details of behaviors with effective features for distinction. However, previous fine-grained behaviors limited themselves to controlling partial information to be similar, leading to an unfair and not comprehensive evaluation of existing works. In this work, we develop a new video fine-grained behavior dataset, named BEAR, which provides fine-grained (i.e. similar) behaviors that uniquely focus on two primary factors defining behavior: Environment and Action. It includes two fine-grained behavior protocols including Fine-grained Behavior with Similar Environments and Fine-grained Behavior with Similar Actions as well as multiple sub-protocols as different scenarios. Furthermore, with this new dataset, we conduct multiple experiments with different behavior recognition models. Our research primarily explores the impact of input modality, a critical element in studying the environmental and action-based aspects of behavior recognition. Our experimental results yield intriguing insights that have substantial implications for further research endeavors.

Paper Structure

This paper contains 10 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: BEAR dataset for exploring the effectiveness of different modalities in two important factors of behavior recognition -- environment and action. (a) The protocols of BEAR are in well-controlled conditions with the environment and action. (b) The conclusion of the information is that different modalities are learned for behavior recognition. ✓: Modality can learn this factor. ✘: Modality cannot learn this factor. $\triangle$: Modality can learn this factor in some condition.
  • Figure 2: Protocols of FG-BSE and FG-BSA. (a) FG-BSE-AD Protocol. (b) FG-BSE-EAG protocol. (c) FG-BSE-EAW protocol. The dotted block shows the categories used in this setting. (d) The relations of similar actions from ConceptNet. The arrow $\rightarrow$ indicates the relations between different behaviors. All pairs in the FG-BSE-BSA setting have similar actions semantically.
  • Figure 3: Confusion matrix of the (a) TSN with RGB input and (b) TSN with optical flow field input. Every red block indicates the pairs of videos in the same environment. (c) The optical flow field of the videos in class "Riding a bicycle". The videos from the top to bottom are the normal video, anomaly video that is misclassified in TSN(flow), and anomaly video that is classified correctly in TSN(flow). The numbers above are the frame index of the videos.
  • Figure 4: Designs of video-text multi-modality model. (a) vanilla VideoCLIP. (b) VideoCLIP with leveled prompt. First, the model classifies the videos via environment prompts. Then, the video is classified by the prompts of corresponding behavior.