Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning

Xiangyan Qu; Jing Yu; Keke Gai; Jiamin Zhuang; Yuanmin Tang; Gang Xiong; Gaopeng Gou; Qi Wu

Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning

Xiangyan Qu, Jing Yu, Keke Gai, Jiamin Zhuang, Yuanmin Tang, Gang Xiong, Gaopeng Gou, Qi Wu

TL;DR

This paper tackles document-based zero-shot learning by addressing the misalignment between document semantics and visual content. It introduces EmDepart, a framework that decomposes both visual and textual information into multi-view semantic embeddings via a Semantic Decomposition Module, and uses loss terms to prevent feature collapse and promote diversity among views. Partial semantic alignment is achieved through view-level and word-to-patch level interactions, plus a partial scoring mechanism that filters unmatched information during inference. The approach achieves state-of-the-art results on three benchmarks (AWA2, CUB, FLO) using Wiki and LLM-enriched Wiki documents, while also providing qualitative demonstrations of interpretable partial associations. This work advances knowledge transfer in document-based ZSL and offers a scalable, interpretable mechanism for aligning heterogeneous semantic signals across modalities.

Abstract

Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial association.

Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning

TL;DR

Abstract

Paper Structure (36 sections, 14 equations, 7 figures, 14 tables)

This paper contains 36 sections, 14 equations, 7 figures, 14 tables.

Introduction
Related work
Method
Document Collection
Feature Extractor
Semantic Decomposition Module
Distinct Semantic Information Learning
Partial Semantic Alignment
Inference
Experiments
Comparing with the SOTA Methods
Analysis of Feature Collapse
Analysis of Partial Association
Ablation Study
Impact of Hyperparameters
...and 21 more sections

Figures (7)

Figure 1: Partial associations between documents and images. The semantic content in the category document may partially be reflected in the image. Distinct images capture varying aspects of the semantic information within the document.
Figure 2: Illustration of different methods. (a) Existing methods align the entire semantics of documents with images. (b) Our model decomposes semantic concepts and models the partial association to align the matching concepts accurately.
Figure 3: An overview of our model. (a) The EmDepart contains an image perceiver, a text perceiver, and visual and textual semantic decomposition modules. (b) Our loss functions. The first loss encourages each view embedding to focus on distinct local details. The second loss penalizes each embedding orthogonal to others. The last two losses partially align semantics at the view and word-to-patch levels.
Figure 4: Analysis of feature collapse. Each number denotes a class (same color), and each shape denotes one of the view embeddings. With the addition of $\mathcal{L}_{var}$ and $\mathcal{L}_{div}$ , information differences between embeddings gradually increase.
Figure 5: Analysis of baseline and model after adding our losses. The larger $S_{var}$ denotes more distinct between embeddings, and $S_{var} = 0$ denotes embeddings are all the same.
...and 2 more figures

Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning

TL;DR

Abstract

Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)