Table of Contents
Fetching ...

Leveraging Transformers for Weakly Supervised Object Localization in Unconstrained Videos

Shakeeb Murtaza, Marco Pedersoli, Aydin Sarraf, Eric Granger

TL;DR

This work tackles weakly supervised video object localization in unconstrained videos by introducing TrCAM-V, a transformer-based CAM framework built on a DeiT backbone with two heads: classification trained on video-level labels and localization trained on pseudo-pixels derived from a pre-trained CLIP model. Pseudo-labels are generated via GradCAM on CLIP using sharp prompts, with FG/BG regions selected through Otsu thresholding and refined by stochastic pseudo-pixel sampling; a CRF loss aligns the localization map with object boundaries. The model is trained end-to-end without temporal supervision and inferred frame-by-frame for real-time localization. On YouTube-Object datasets, TrCAM-V achieves state-of-the-art localization and classification performance, underscoring the effectiveness of CLIP-derived pseudo-label supervision combined with transformer-based CAM.

Abstract

Weakly-Supervised Video Object Localization (WSVOL) involves localizing an object in videos using only video-level labels, also referred to as tags. State-of-the-art WSVOL methods like Temporal CAM (TCAM) rely on class activation mapping (CAM) and typically require a pre-trained CNN classifier. However, their localization accuracy is affected by their tendency to minimize the mutual information between different instances of a class and exploit temporal information during training for downstream tasks, e.g., detection and tracking. In the absence of bounding box annotation, it is challenging to exploit precise information about objects from temporal cues because the model struggles to locate objects over time. To address these issues, a novel method called transformer based CAM for videos (TrCAM-V), is proposed for WSVOL. It consists of a DeiT backbone with two heads for classification and localization. The classification head is trained using standard classification loss (CL), while the localization head is trained using pseudo-labels that are extracted using a pre-trained CLIP model. From these pseudo-labels, the high and low activation values are considered to be foreground and background regions, respectively. Our TrCAM-V method allows training a localization network by sampling pseudo-pixels on the fly from these regions. Additionally, a conditional random field (CRF) loss is employed to align the object boundaries with the foreground map. During inference, the model can process individual frames for real-time localization applications. Extensive experiments on challenging YouTube-Objects unconstrained video datasets show that our TrCAM-V method achieves new state-of-the-art performance in terms of classification and localization accuracy.

Leveraging Transformers for Weakly Supervised Object Localization in Unconstrained Videos

TL;DR

This work tackles weakly supervised video object localization in unconstrained videos by introducing TrCAM-V, a transformer-based CAM framework built on a DeiT backbone with two heads: classification trained on video-level labels and localization trained on pseudo-pixels derived from a pre-trained CLIP model. Pseudo-labels are generated via GradCAM on CLIP using sharp prompts, with FG/BG regions selected through Otsu thresholding and refined by stochastic pseudo-pixel sampling; a CRF loss aligns the localization map with object boundaries. The model is trained end-to-end without temporal supervision and inferred frame-by-frame for real-time localization. On YouTube-Object datasets, TrCAM-V achieves state-of-the-art localization and classification performance, underscoring the effectiveness of CLIP-derived pseudo-label supervision combined with transformer-based CAM.

Abstract

Weakly-Supervised Video Object Localization (WSVOL) involves localizing an object in videos using only video-level labels, also referred to as tags. State-of-the-art WSVOL methods like Temporal CAM (TCAM) rely on class activation mapping (CAM) and typically require a pre-trained CNN classifier. However, their localization accuracy is affected by their tendency to minimize the mutual information between different instances of a class and exploit temporal information during training for downstream tasks, e.g., detection and tracking. In the absence of bounding box annotation, it is challenging to exploit precise information about objects from temporal cues because the model struggles to locate objects over time. To address these issues, a novel method called transformer based CAM for videos (TrCAM-V), is proposed for WSVOL. It consists of a DeiT backbone with two heads for classification and localization. The classification head is trained using standard classification loss (CL), while the localization head is trained using pseudo-labels that are extracted using a pre-trained CLIP model. From these pseudo-labels, the high and low activation values are considered to be foreground and background regions, respectively. Our TrCAM-V method allows training a localization network by sampling pseudo-pixels on the fly from these regions. Additionally, a conditional random field (CRF) loss is employed to align the object boundaries with the foreground map. During inference, the model can process individual frames for real-time localization applications. Extensive experiments on challenging YouTube-Objects unconstrained video datasets show that our TrCAM-V method achieves new state-of-the-art performance in terms of classification and localization accuracy.
Paper Structure (10 sections, 5 equations, 2 figures, 3 tables)

This paper contains 10 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of the proposed TrCAM-V training architecture. It consists of DeiT backbone a classification and a localization head that are trained using class labels and pseudo labels, respectively. A pre-trained CLIP model is employed to generate pseudo-labels by utilizing a sharpness-based prompt along with the input image, as suggested in lin2023clip. These pseudo-labels are then used to sample pseudo-pixels for training the localization head. For inference, we only retain DeiT with both heads.
  • Figure 2: Visualization of YTOv1 frames. Here, red and green box indicate the predicated and annotated bounding- box.