Table of Contents
Fetching ...

MADS: Multi-Attribute Document Supervision for Zero-Shot Image Classification

Xiangyan Qu, Jing Yu, Jiamin Zhuang, Gaopeng Gou, Gang Xiong, Qi Wu

TL;DR

This work tackles zero-shot image classification by leveraging encyclopedic documents as auxiliary knowledge and addressing three key issues: noisy non-visual content, under-described fine-grained categories, and suboptimal word-region alignment. It introduces Multi-Attribute Document Supervision (MADS), a framework that uses LLM-guided noise removal to produce multi-attribute documents and a dedicated network that learns independent per-view semantics, aggregates cross-view information, and explicitly focuses attention on visual words. The model employs global and local alignment losses to connect semantic embeddings with image features, enabling robust cross-modal transfer and interpretable, view-level predictions. Experiments on AWA2, CUB, and FLO demonstrate state-of-the-art gains in ZSL and GZSL with comparable compute, and ablations validate the contributions of noise suppression, multi-view modeling, and the focus mechanism. The approach offers practical benefits by reducing reliance on manual attribute annotations and providing interpretable, multi-view explanations for predictions.

Abstract

Zero-shot learning (ZSL) aims to train a model on seen classes and recognize unseen classes by knowledge transfer through shared auxiliary information. Recent studies reveal that documents from encyclopedias provide helpful auxiliary information. However, existing methods align noisy documents, entangled in visual and non-visual descriptions, with image regions, yet solely depend on implicit learning. These models fail to filter non-visual noise reliably and incorrectly align non-visual words to image regions, which is harmful to knowledge transfer. In this work, we propose a novel multi-attribute document supervision framework to remove noises at both document collection and model learning stages. With the help of large language models, we introduce a novel prompt algorithm that automatically removes non-visual descriptions and enriches less-described documents in multiple attribute views. Our proposed model, MADS, extracts multi-view transferable knowledge with information decoupling and semantic interactions for semantic alignment at local and global levels. Besides, we introduce a model-agnostic focus loss to explicitly enhance attention to visually discriminative information during training, also improving existing methods without additional parameters. With comparable computation costs, MADS consistently outperforms the SOTA by 7.2% and 8.2% on average in three benchmarks for document-based ZSL and GZSL settings, respectively. Moreover, we qualitatively offer interpretable predictions from multiple attribute views.

MADS: Multi-Attribute Document Supervision for Zero-Shot Image Classification

TL;DR

This work tackles zero-shot image classification by leveraging encyclopedic documents as auxiliary knowledge and addressing three key issues: noisy non-visual content, under-described fine-grained categories, and suboptimal word-region alignment. It introduces Multi-Attribute Document Supervision (MADS), a framework that uses LLM-guided noise removal to produce multi-attribute documents and a dedicated network that learns independent per-view semantics, aggregates cross-view information, and explicitly focuses attention on visual words. The model employs global and local alignment losses to connect semantic embeddings with image features, enabling robust cross-modal transfer and interpretable, view-level predictions. Experiments on AWA2, CUB, and FLO demonstrate state-of-the-art gains in ZSL and GZSL with comparable compute, and ablations validate the contributions of noise suppression, multi-view modeling, and the focus mechanism. The approach offers practical benefits by reducing reliance on manual attribute annotations and providing interpretable, multi-view explanations for predictions.

Abstract

Zero-shot learning (ZSL) aims to train a model on seen classes and recognize unseen classes by knowledge transfer through shared auxiliary information. Recent studies reveal that documents from encyclopedias provide helpful auxiliary information. However, existing methods align noisy documents, entangled in visual and non-visual descriptions, with image regions, yet solely depend on implicit learning. These models fail to filter non-visual noise reliably and incorrectly align non-visual words to image regions, which is harmful to knowledge transfer. In this work, we propose a novel multi-attribute document supervision framework to remove noises at both document collection and model learning stages. With the help of large language models, we introduce a novel prompt algorithm that automatically removes non-visual descriptions and enriches less-described documents in multiple attribute views. Our proposed model, MADS, extracts multi-view transferable knowledge with information decoupling and semantic interactions for semantic alignment at local and global levels. Besides, we introduce a model-agnostic focus loss to explicitly enhance attention to visually discriminative information during training, also improving existing methods without additional parameters. With comparable computation costs, MADS consistently outperforms the SOTA by 7.2% and 8.2% on average in three benchmarks for document-based ZSL and GZSL settings, respectively. Moreover, we qualitatively offer interpretable predictions from multiple attribute views.

Paper Structure

This paper contains 30 sections, 14 equations, 12 figures, 17 tables, 1 algorithm.

Figures (12)

  • Figure 1: (a) In document-based ZSL, visual words bridge knowledge transfer from seen to unseen classes (shown in the same color). (b) Previous methods align noisy documents with image regions, which is detrimental to knowledge transfer. (c) In contrast, we automatically remove non-visual noise before model training to obtain multi-attribute documents and then explicitly align visual words of each attribute view with salient image regions.
  • Figure 2: An overview of our framework. (a) Document Collection. We instruct LLMs to divide the definition document into paragraphs based on attribute views and enrich the less-described attribute documents. (b) Our MADS Network. The MADS first extracts the core semantics of each attribute paragraph independently and aggregates multi-view semantics to enhance interactions, aligning with global and local image embeddings. Unlike previous methods that solely depend on implicit attention mechanisms, we introduce a focus loss to explicitly filter noisy information and attend to visual words.
  • Figure 3: Visualization of salient image regions and most attended words in attention mechanism. Although both the baseline and the model with the focus loss extract the salient visual regions, the latter pays more attention to visual words (shown with the white area) that are helpful for ZSL tasks.
  • Figure 4: Illustration of two settings. Default setting: We use the class name and corresponding descriptions to classify. Classification without class name: We replace the name with the dataset domain to remove priors on class names.
  • Figure 5: Effect of loss weights (a-b) and hyperparameter analysis (c-e). The shaded area in (a-b) denotes the performance improvements compared with loss weights set as 0, and in (c-e) denotes the error bars of models trained with three different documents.
  • ...and 7 more figures