Table of Contents
Fetching ...

Conterfactual Generative Zero-Shot Semantic Segmentation

Feihong Shen, Jun Liu, Ping Hu

TL;DR

The paper tackles bias in zero-shot semantic segmentation arising from spurious statistical correlations in generative models. It proposes a counterfactual deconfounding framework that creates two branches—one anchored on real features ($R$) and another on generated features ($F$)—and fuses their predictions while mitigating indirect effects via counterfactual reasoning, using $TE$, $NDE$, and $NIE$ to isolate true causal effects. A Kalman-inspired fusion based on $var(R)$ and $var(F)$, plus a two-branch loss $\,\mathcal{L}_{pred}$, enables unbiased learning; the approach is extended with a Graph Convolutional Network to propagate information across related classes, improving unseen-class generation. Empirically, the method improves over strong baselines (ZS3Net, SPNet) on Pascal-VOC 2012 and Pascal-Context, with notable gains for unseen classes and complementary gains when combining with GCN. This framework provides a principled, generally applicable route to reduce bias in generative zero-shot segmentation and can be extended with additional causal or relational components for broader CV tasks.

Abstract

zero-shot learning is an essential part of computer vision. As a classical downstream task, zero-shot semantic segmentation has been studied because of its applicant value. One of the popular zero-shot semantic segmentation methods is based on the generative model Most new proposed works added structures on the same architecture to enhance this model. However, we found that, from the view of causal inference, the result of the original model has been influenced by spurious statistical relationships. Thus the performance of the prediction shows severe bias. In this work, we consider counterfactual methods to avoid the confounder in the original model. Based on this method, we proposed a new framework for zero-shot semantic segmentation. Our model is compared with baseline models on two real-world datasets, Pascal-VOC and Pascal-Context. The experiment results show proposed models can surpass previous confounded models and can still make use of additional structures to improve the performance. We also design a simple structure based on Graph Convolutional Networks (GCN) in this work.

Conterfactual Generative Zero-Shot Semantic Segmentation

TL;DR

The paper tackles bias in zero-shot semantic segmentation arising from spurious statistical correlations in generative models. It proposes a counterfactual deconfounding framework that creates two branches—one anchored on real features () and another on generated features ()—and fuses their predictions while mitigating indirect effects via counterfactual reasoning, using , , and to isolate true causal effects. A Kalman-inspired fusion based on and , plus a two-branch loss , enables unbiased learning; the approach is extended with a Graph Convolutional Network to propagate information across related classes, improving unseen-class generation. Empirically, the method improves over strong baselines (ZS3Net, SPNet) on Pascal-VOC 2012 and Pascal-Context, with notable gains for unseen classes and complementary gains when combining with GCN. This framework provides a principled, generally applicable route to reduce bias in generative zero-shot segmentation and can be extended with additional causal or relational components for broader CV tasks.

Abstract

zero-shot learning is an essential part of computer vision. As a classical downstream task, zero-shot semantic segmentation has been studied because of its applicant value. One of the popular zero-shot semantic segmentation methods is based on the generative model Most new proposed works added structures on the same architecture to enhance this model. However, we found that, from the view of causal inference, the result of the original model has been influenced by spurious statistical relationships. Thus the performance of the prediction shows severe bias. In this work, we consider counterfactual methods to avoid the confounder in the original model. Based on this method, we proposed a new framework for zero-shot semantic segmentation. Our model is compared with baseline models on two real-world datasets, Pascal-VOC and Pascal-Context. The experiment results show proposed models can surpass previous confounded models and can still make use of additional structures to improve the performance. We also design a simple structure based on Graph Convolutional Networks (GCN) in this work.

Paper Structure

This paper contains 14 sections, 15 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: The first column is the original images that contain unseen classes: motorbike and cat. The second column is the pixel-level ground truth (GT) of the semantic segmentation. In the traditional generative zero-shot semantic segmentation model ZS3Net, the objects were incorrectly classified as their similar classes: bicycle and dog. These two classes are visible in the training set and this dataset bias was alleviated in our model..
  • Figure 2: Illustration of our counterfactual generative zero-shot semantic segmentation framework. In the training phase, once one image input, the backbone extracts the visible features. We take these features to train our classifier A and generators. If there are unseen classes in the annotation of this picture, we use fake features from generators to train our classifier B. We apply the deconfounder process to the output of classifier B. The output of these two classifiers is mixed up by our fusion strategy. In the testing phase, we directly cast features into classifier A and B. The output of the fusion function will be our final prediction.
  • Figure 3: (a)The causal structure of traditional generative zero-shot semantic segmentation. (b)The ideal deconfounder model for zero-shot semantic segmentation (c)The causal structure of SPNet
  • Figure 4: Our cause-effect look at Eq. \ref{['eq:out']}. One of our model's training branches is based on this equation.
  • Figure 5: Suppose we have three classes of objects in our dataset: cat, cow, and bike. The images of the bike are unseen to our structure. After getting seen features from the backbone, we put them to their lines in the matrix and keep the line that represents the feature of the bike empty. After the graph convolution, we get the fake features of the bike class. In our structure, we regard the graph convolution network as a tool of message passing in an irregular data structure at a time and we get the message of unseen features from seen features in the relationship graph.
  • ...and 3 more figures