Table of Contents
Fetching ...

CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos

Soufiane Belharbi, Shakeeb Murtaza, Marco Pedersoli, Ismail Ben Ayed, Luke McCaffrey, Eric Granger

TL;DR

CoLo-CAM tackles weakly supervised video object localization in unconstrained videos by introducing a color-based co-localization objective that jointly trains CAMs across multiple frames without constraining object motion. Building on an F-CAM–style encoder–decoder architecture, it combines per-frame pseudo-labels, CRF-based local consistency, and an absolute size constraint with a novel multi-frame color CRF term that enforces consistent activations on similarly colored pixels across frames. The method achieves state-of-the-art CorLoc on YouTube-Object datasets, demonstrating robustness to long-term temporal dependencies and producing sharper, more complete object activations while maintaining practical inference speed. Limitations include frames without the target object and temporal instability in inference, suggesting directions for incorporating frame-level presence detection and improved temporal inference strategies.

Abstract

Leveraging spatiotemporal information in videos is critical for weakly supervised video object localization (WSVOL) tasks. However, state-of-the-art methods only rely on visual and motion cues, while discarding discriminative information, making them susceptible to inaccurate localizations. Recently, discriminative models have been explored for WSVOL tasks using a temporal class activation mapping (CAM) method. Although their results are promising, objects are assumed to have limited movement from frame to frame, leading to degradation in performance for relatively long-term dependencies. This paper proposes a novel CAM method for WSVOL that exploits spatiotemporal information in activation maps during training without constraining an object's position. Its training relies on Co-Localization, hence, the name CoLo-CAM. Given a sequence of frames, localization is jointly learned based on color cues extracted across the corresponding maps, by assuming that an object has similar color in consecutive frames. CAM activations are constrained to respond similarly over pixels with similar colors, achieving co-localization. This improves localization performance because the joint learning creates direct communication among pixels across all image locations and over all frames, allowing for transfer, aggregation, and correction of localizations. Co-localization is integrated into training by minimizing the color term of a conditional random field (CRF) loss over a sequence of frames/CAMs. Extensive experiments on two challenging YouTube-Objects datasets of unconstrained videos show the merits of our method, and its robustness to long-term dependencies, leading to new state-of-the-art performance for WSVOL task.

CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos

TL;DR

CoLo-CAM tackles weakly supervised video object localization in unconstrained videos by introducing a color-based co-localization objective that jointly trains CAMs across multiple frames without constraining object motion. Building on an F-CAM–style encoder–decoder architecture, it combines per-frame pseudo-labels, CRF-based local consistency, and an absolute size constraint with a novel multi-frame color CRF term that enforces consistent activations on similarly colored pixels across frames. The method achieves state-of-the-art CorLoc on YouTube-Object datasets, demonstrating robustness to long-term temporal dependencies and producing sharper, more complete object activations while maintaining practical inference speed. Limitations include frames without the target object and temporal instability in inference, suggesting directions for incorporating frame-level presence detection and improved temporal inference strategies.

Abstract

Leveraging spatiotemporal information in videos is critical for weakly supervised video object localization (WSVOL) tasks. However, state-of-the-art methods only rely on visual and motion cues, while discarding discriminative information, making them susceptible to inaccurate localizations. Recently, discriminative models have been explored for WSVOL tasks using a temporal class activation mapping (CAM) method. Although their results are promising, objects are assumed to have limited movement from frame to frame, leading to degradation in performance for relatively long-term dependencies. This paper proposes a novel CAM method for WSVOL that exploits spatiotemporal information in activation maps during training without constraining an object's position. Its training relies on Co-Localization, hence, the name CoLo-CAM. Given a sequence of frames, localization is jointly learned based on color cues extracted across the corresponding maps, by assuming that an object has similar color in consecutive frames. CAM activations are constrained to respond similarly over pixels with similar colors, achieving co-localization. This improves localization performance because the joint learning creates direct communication among pixels across all image locations and over all frames, allowing for transfer, aggregation, and correction of localizations. Co-localization is integrated into training by minimizing the color term of a conditional random field (CRF) loss over a sequence of frames/CAMs. Extensive experiments on two challenging YouTube-Objects datasets of unconstrained videos show the merits of our method, and its robustness to long-term dependencies, leading to new state-of-the-art performance for WSVOL task.
Paper Structure (17 sections, 7 equations, 8 figures, 8 tables)

This paper contains 17 sections, 7 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Illustration of the combination of the single and multi-frame training process using our CoLo-CAM method in the case where $n=3$ frames. For the multi-frame loss, each pair of pixels across all three images are interconnected. For clarity, only few pixels and connections are shown. Green dots are foreground pixels. Blue dots are background pixels. The classifier (encoder $g$ + classification head) is pre-trained, and frozen, and only the decoder $f$ is trained. We employ per-frame and multi-framer losses. Our multi-frame loss ${\mathcal{R}_c}$ assumes that an object of interest has similar colour over multiple adjacent frames, i.e. the object 'horse' in the frames. This assumption also applies to background objects, i.e., the 'sky', 'grass', and 'trees'. CAMs are constrained to have similar responses over pixels with similar color across different locations over all frames. Each pair of pixels across the three frames are interconnected. Solid connections illustrate strong visual similarity, while dotted connections indicate weak similarity. Additionally, the per-frame terms consist of the pseudo-label loss, absolute size constraint (${\mathcal{R}_s}$), and CRF loss (${\mathcal{R}}$).
  • Figure 2: Ablations on the YTOv1 test set. (a) Impact of the number of frames ${n}$ over CorLoc accuracy. (b) Impact of the adaptive${\lambda_c}$ over CorLoc accuracy. (c)Left y-axis, Orange: CorLoc accuracy on the YTOv1 test set using constant and adaptive ${\lambda_c}$ weight. Right y-axis, Green: ${\log{( \left| \mathcal{R}_c \right|)}}$ . The x-axis is the number of frames $n$. (Better visualized in color) (d) Computation time of our multi-frame loss term (Eq.\ref{['eq:crf_rgb']}) in function of number of frames $n$.
  • Figure 3: Stability of CoLo-CAM localization performance (CorLoc) on YTOv1 and YTOv2.2 test set. Results are obtained when randomizing the effective training set by removing $m$ random shots from the train set. This process is repeated 30 times.
  • Figure 4: Typical challenges of our method. Column 1-2: Dense and overlapped instances lead to fused localization of instances. Column 3-4: Object mis-localization. Classifier CAM does not activate over the right object. Our trained decoder using these CAMs is unable to correct this large error. Bounding boxes: ground truth (green), prediction (red).
  • Figure 5: Localization examples of test sets frames. Bounding boxes: ground truth (green), prediction (red). The second column of each method is the predicted CAM over image.
  • ...and 3 more figures