Knowledge-guided Causal Intervention for Weakly-supervised Object Localization
Feifei Shao, Yawei Luo, Fei Gao, Yi Yang, Jun Xiao
TL;DR
KG-CI-CAM tackles two core WSOL challenges—entangled context and the classification-localization dilemma—by integrating a structural causal model with a causal context pool and a multi-source knowledge guidance framework. It introduces CI-CAM, a causal intervention-based CAM architecture with a causal context pool to remove confounding context, and employs a dual-expert knowledge transfer to balance classification and localization during training. Empirical results on CUB-200-2011 and ILSVRC 2016 show consistent improvements over strong baselines and several SOTA methods across multiple metrics, particularly in challenging settings with contextual confounding. The work demonstrates that combining causal reasoning with targeted knowledge transfer can substantially enhance weakly-supervised object localization without sacrificing classification performance.
Abstract
Previous weakly-supervised object localization (WSOL) methods aim to expand activation map discriminative areas to cover the whole objects, yet neglect two inherent challenges when relying solely on image-level labels. First, the ``entangled context'' issue arises from object-context co-occurrence (\eg, fish and water), making the model inspection hard to distinguish object boundaries clearly. Second, the ``C-L dilemma'' issue results from the information decay caused by the pooling layers, which struggle to retain both the semantic information for precise classification and those essential details for accurate localization, leading to a trade-off in performance. In this paper, we propose a knowledge-guided causal intervention method, dubbed KG-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention, which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the disentangled object feature, we introduce a multi-source knowledge guidance framework to strike a balance between absorbing classification knowledge and localization knowledge during model training. Extensive experiments conducted on several benchmark datasets demonstrate the effectiveness of KG-CI-CAM in learning distinct object boundaries amidst confounding contexts and mitigating the dilemma between classification and localization performance.
