Table of Contents
Fetching ...

Robust Visual Representation Learning with Multi-modal Prior Knowledge for Image Classification Under Distribution Shift

Hongkuan Zhou, Lavdim Halilaj, Sebastian Monka, Stefan Schmid, Yuqicheng Zhu, Bo Xiong, Steffen Staab

TL;DR

This work tackles distribution-shift robustness by proposing Knowledge-Guided Visual representation learning (KGV), a neuro-symbolic framework that jointly leverages a domain knowledge graph and synthetic visuals aligned in a common latent space. A novel translation-based KG embedding is used, with KG nodes modeled as Gaussian distributions to capture hierarchical and instance-level variability, and embeddings are regularized through a combined loss with standard supervised objectives. Across Road Sign, Car, and ImageNet domains, KGV yields consistent improvements in accuracy and data efficiency under distribution shifts and few-shot settings, and it remains compatible with large pre-trained vision models. The approach highlights the value of multi-modal priors for robust representation learning and offers a scalable path to integrating textual and symbolic knowledge with visual data.

Abstract

Despite the remarkable success of deep neural networks (DNNs) in computer vision, they fail to remain high-performing when facing distribution shifts between training and testing data. In this paper, we propose Knowledge-Guided Visual representation learning (KGV) - a distribution-based learning approach leveraging multi-modal prior knowledge - to improve generalization under distribution shift. It integrates knowledge from two distinct modalities: 1) a knowledge graph (KG) with hierarchical and association relationships; and 2) generated synthetic images of visual elements semantically represented in the KG. The respective embeddings are generated from the given modalities in a common latent space, i.e., visual embeddings from original and synthetic images as well as knowledge graph embeddings (KGEs). These embeddings are aligned via a novel variant of translation-based KGE methods, where the node and relation embeddings of the KG are modeled as Gaussian distributions and translations, respectively. We claim that incorporating multi-model prior knowledge enables more regularized learning of image representations. Thus, the models are able to better generalize across different data distributions. We evaluate KGV on different image classification tasks with major or minor distribution shifts, namely road sign classification across datasets from Germany, China, and Russia, image classification with the mini-ImageNet dataset and its variants, as well as the DVM-CAR dataset. The results demonstrate that KGV consistently exhibits higher accuracy and data efficiency across all experiments.

Robust Visual Representation Learning with Multi-modal Prior Knowledge for Image Classification Under Distribution Shift

TL;DR

This work tackles distribution-shift robustness by proposing Knowledge-Guided Visual representation learning (KGV), a neuro-symbolic framework that jointly leverages a domain knowledge graph and synthetic visuals aligned in a common latent space. A novel translation-based KG embedding is used, with KG nodes modeled as Gaussian distributions to capture hierarchical and instance-level variability, and embeddings are regularized through a combined loss with standard supervised objectives. Across Road Sign, Car, and ImageNet domains, KGV yields consistent improvements in accuracy and data efficiency under distribution shifts and few-shot settings, and it remains compatible with large pre-trained vision models. The approach highlights the value of multi-modal priors for robust representation learning and offers a scalable path to integrating textual and symbolic knowledge with visual data.

Abstract

Despite the remarkable success of deep neural networks (DNNs) in computer vision, they fail to remain high-performing when facing distribution shifts between training and testing data. In this paper, we propose Knowledge-Guided Visual representation learning (KGV) - a distribution-based learning approach leveraging multi-modal prior knowledge - to improve generalization under distribution shift. It integrates knowledge from two distinct modalities: 1) a knowledge graph (KG) with hierarchical and association relationships; and 2) generated synthetic images of visual elements semantically represented in the KG. The respective embeddings are generated from the given modalities in a common latent space, i.e., visual embeddings from original and synthetic images as well as knowledge graph embeddings (KGEs). These embeddings are aligned via a novel variant of translation-based KGE methods, where the node and relation embeddings of the KG are modeled as Gaussian distributions and translations, respectively. We claim that incorporating multi-model prior knowledge enables more regularized learning of image representations. Thus, the models are able to better generalize across different data distributions. We evaluate KGV on different image classification tasks with major or minor distribution shifts, namely road sign classification across datasets from Germany, China, and Russia, image classification with the mini-ImageNet dataset and its variants, as well as the DVM-CAR dataset. The results demonstrate that KGV consistently exhibits higher accuracy and data efficiency across all experiments.

Paper Structure

This paper contains 41 sections, 4 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: (a) Examples of distribution shift in road sign, ImageNet, and Car Domains. (b) Decomposition of elements w.r.t. the road signs and cars. It represents the association relations between object categories and object category elements. (c) Examples of the abstract representation of hierarchical relations in the ImageNet domain.
  • Figure 2: The KGV architecture -- Our approach consists of three phases, namely knowledge modeling, training, and inference. Knowledge modeling phase: We create a knowledge graph based on prior domain knowledge. Also, synthetic images are generated for object categories (e.g., shapes and colors) that are semantically represented in the knowledge graph but lack visual information in the dataset. Training phase: The neural network is fed with both synthetic and real-world images and trained end-to-end by adding the regularization loss and cross-entropy loss together as a total loss for optimization. The image embeddings $\boldsymbol{z^I}$ and knowledge graph embeddings $\boldsymbol{z^r_i}$, $\boldsymbol{z^o_j}$ are aligned by minimizing the regularization loss. The Cross-entropy loss is used to classify images based on their image embedding representation. Inference phase: The classification task is completed by selecting the class with the highest possibility based on the output of the decoder.
  • Figure 3: Performance under Low Data Regime. KGV$^-$ is the KGV variant trained without synthetic images.
  • Figure 4: An abstract of hierarchical relations existing in the road sign recognition domain.
  • Figure 5: The relations between one image instance of Porsche 911 with other nodes in the car recognition KG.
  • ...and 1 more figures