Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning
Xiangyan Qu, Jing Yu, Keke Gai, Jiamin Zhuang, Yuanmin Tang, Gang Xiong, Gaopeng Gou, Qi Wu
TL;DR
This paper tackles document-based zero-shot learning by addressing the misalignment between document semantics and visual content. It introduces EmDepart, a framework that decomposes both visual and textual information into multi-view semantic embeddings via a Semantic Decomposition Module, and uses loss terms to prevent feature collapse and promote diversity among views. Partial semantic alignment is achieved through view-level and word-to-patch level interactions, plus a partial scoring mechanism that filters unmatched information during inference. The approach achieves state-of-the-art results on three benchmarks (AWA2, CUB, FLO) using Wiki and LLM-enriched Wiki documents, while also providing qualitative demonstrations of interpretable partial associations. This work advances knowledge transfer in document-based ZSL and offers a scalable, interpretable mechanism for aligning heterogeneous semantic signals across modalities.
Abstract
Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial association.
