How to Understand "Support"? An Implicit-enhanced Causal Inference Approach for Weakly-supervised Phrase Grounding
Jiamin Luo, Jianing Zhao, Jingjing Wang, Guodong Zhou
TL;DR
Weakly-supervised Phrase Grounding (WPG) often overlooks implicit phrase-region relations that encode deeper multimodal semantics. IECI integrates causal inference techniques—front-door intervention to mitigate confounding and counterfactual reasoning to highlight implicit relations—within an Encoding Block, Implicit-aware Deconfounded Attention, and Implicit-aware Counterfactual Inference, and it leverages a NWGM-based approach to estimate $P(L|do(X))$ and derive the implicit/explicit alignment. The authors also create a high-quality implicit-enhanced dataset to benchmark this capability, showing IECI outperforms state-of-the-art baselines and even strong multimodal LLMs on the implicit task, with ablation analyses confirming the contributions of both causal components. This work advances weakly-supervised visual grounding by explicitly modeling and evaluating implicit semantics, offering a practical framework for improving and assessing multimodal understanding in real-world settings.
Abstract
Weakly-supervised Phrase Grounding (WPG) is an emerging task of inferring the fine-grained phrase-region matching, while merely leveraging the coarse-grained sentence-image pairs for training. However, existing studies on WPG largely ignore the implicit phrase-region matching relations, which are crucial for evaluating the capability of models in understanding the deep multimodal semantics. To this end, this paper proposes an Implicit-Enhanced Causal Inference (IECI) approach to address the challenges of modeling the implicit relations and highlighting them beyond the explicit. Specifically, this approach leverages both the intervention and counterfactual techniques to tackle the above two challenges respectively. Furthermore, a high-quality implicit-enhanced dataset is annotated to evaluate IECI and detailed evaluations show the great advantages of IECI over the state-of-the-art baselines. Particularly, we observe an interesting finding that IECI outperforms the advanced multimodal LLMs by a large margin on this implicit-enhanced dataset, which may facilitate more research to evaluate the multimodal LLMs in this direction.
