Conterfactual Generative Zero-Shot Semantic Segmentation
Feihong Shen, Jun Liu, Ping Hu
TL;DR
The paper tackles bias in zero-shot semantic segmentation arising from spurious statistical correlations in generative models. It proposes a counterfactual deconfounding framework that creates two branches—one anchored on real features ($R$) and another on generated features ($F$)—and fuses their predictions while mitigating indirect effects via counterfactual reasoning, using $TE$, $NDE$, and $NIE$ to isolate true causal effects. A Kalman-inspired fusion based on $var(R)$ and $var(F)$, plus a two-branch loss $\,\mathcal{L}_{pred}$, enables unbiased learning; the approach is extended with a Graph Convolutional Network to propagate information across related classes, improving unseen-class generation. Empirically, the method improves over strong baselines (ZS3Net, SPNet) on Pascal-VOC 2012 and Pascal-Context, with notable gains for unseen classes and complementary gains when combining with GCN. This framework provides a principled, generally applicable route to reduce bias in generative zero-shot segmentation and can be extended with additional causal or relational components for broader CV tasks.
Abstract
zero-shot learning is an essential part of computer vision. As a classical downstream task, zero-shot semantic segmentation has been studied because of its applicant value. One of the popular zero-shot semantic segmentation methods is based on the generative model Most new proposed works added structures on the same architecture to enhance this model. However, we found that, from the view of causal inference, the result of the original model has been influenced by spurious statistical relationships. Thus the performance of the prediction shows severe bias. In this work, we consider counterfactual methods to avoid the confounder in the original model. Based on this method, we proposed a new framework for zero-shot semantic segmentation. Our model is compared with baseline models on two real-world datasets, Pascal-VOC and Pascal-Context. The experiment results show proposed models can surpass previous confounded models and can still make use of additional structures to improve the performance. We also design a simple structure based on Graph Convolutional Networks (GCN) in this work.
