Table of Contents
Fetching ...

Knowledge-guided Causal Intervention for Weakly-supervised Object Localization

Feifei Shao, Yawei Luo, Fei Gao, Yi Yang, Jun Xiao

TL;DR

KG-CI-CAM tackles two core WSOL challenges—entangled context and the classification-localization dilemma—by integrating a structural causal model with a causal context pool and a multi-source knowledge guidance framework. It introduces CI-CAM, a causal intervention-based CAM architecture with a causal context pool to remove confounding context, and employs a dual-expert knowledge transfer to balance classification and localization during training. Empirical results on CUB-200-2011 and ILSVRC 2016 show consistent improvements over strong baselines and several SOTA methods across multiple metrics, particularly in challenging settings with contextual confounding. The work demonstrates that combining causal reasoning with targeted knowledge transfer can substantially enhance weakly-supervised object localization without sacrificing classification performance.

Abstract

Previous weakly-supervised object localization (WSOL) methods aim to expand activation map discriminative areas to cover the whole objects, yet neglect two inherent challenges when relying solely on image-level labels. First, the ``entangled context'' issue arises from object-context co-occurrence (\eg, fish and water), making the model inspection hard to distinguish object boundaries clearly. Second, the ``C-L dilemma'' issue results from the information decay caused by the pooling layers, which struggle to retain both the semantic information for precise classification and those essential details for accurate localization, leading to a trade-off in performance. In this paper, we propose a knowledge-guided causal intervention method, dubbed KG-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention, which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the disentangled object feature, we introduce a multi-source knowledge guidance framework to strike a balance between absorbing classification knowledge and localization knowledge during model training. Extensive experiments conducted on several benchmark datasets demonstrate the effectiveness of KG-CI-CAM in learning distinct object boundaries amidst confounding contexts and mitigating the dilemma between classification and localization performance.

Knowledge-guided Causal Intervention for Weakly-supervised Object Localization

TL;DR

KG-CI-CAM tackles two core WSOL challenges—entangled context and the classification-localization dilemma—by integrating a structural causal model with a causal context pool and a multi-source knowledge guidance framework. It introduces CI-CAM, a causal intervention-based CAM architecture with a causal context pool to remove confounding context, and employs a dual-expert knowledge transfer to balance classification and localization during training. Empirical results on CUB-200-2011 and ILSVRC 2016 show consistent improvements over strong baselines and several SOTA methods across multiple metrics, particularly in challenging settings with contextual confounding. The work demonstrates that combining causal reasoning with targeted knowledge transfer can substantially enhance weakly-supervised object localization without sacrificing classification performance.

Abstract

Previous weakly-supervised object localization (WSOL) methods aim to expand activation map discriminative areas to cover the whole objects, yet neglect two inherent challenges when relying solely on image-level labels. First, the ``entangled context'' issue arises from object-context co-occurrence (\eg, fish and water), making the model inspection hard to distinguish object boundaries clearly. Second, the ``C-L dilemma'' issue results from the information decay caused by the pooling layers, which struggle to retain both the semantic information for precise classification and those essential details for accurate localization, leading to a trade-off in performance. In this paper, we propose a knowledge-guided causal intervention method, dubbed KG-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention, which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the disentangled object feature, we introduce a multi-source knowledge guidance framework to strike a balance between absorbing classification knowledge and localization knowledge during model training. Extensive experiments conducted on several benchmark datasets demonstrate the effectiveness of KG-CI-CAM in learning distinct object boundaries amidst confounding contexts and mitigating the dilemma between classification and localization performance.
Paper Structure (24 sections, 16 equations, 7 figures, 7 tables)

This paper contains 24 sections, 16 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: (a) Visualization comparison between vanilla CAM, NL-CCAM, CI-CAM, and KG-CI-CAM. The yellow arrows indicate the regions suffer from entangled contexts. (b) The classification-localization dilemma faced by CI-CAM, where the classification and localization suffer from a performance gap and can not achieve their highest accuracy simultaneously.
  • Figure 2: (a) Building the structural causal model (SCM) in WSOL. (b) Cutting off the confounding effect of $C \rightarrow X$ in WSOL. $X$: feature maps. $C$: confounding context. $V$: image representation. $Y$: image label.
  • Figure 3: Overview of the proposed causal network architecture: CI-CAM. CI-CAM consists of four parts: a backbone to extract the feature maps, the share-weighted CAM modules to generate class activation maps, a causal context pool to enhance the feature maps by eliminating the negative effect of confounder, and a combinational module to generate the final bounding box.
  • Figure 4: (a) Overview of the multi-source knowledge guidance framework. (b) Overview of the classification expert network in which two green CI-CAM models are share-weighted with each other. (c) Overview of the localization expert network, featuring three blue CI-CAM models, all of which are identical.
  • Figure 5: Comparison between CI-CAM and localization expert.
  • ...and 2 more figures