Table of Contents
Fetching ...

CICA: Content-Injected Contrastive Alignment for Zero-Shot Document Image Classification

Sankalp Sinha, Muhammad Saif Ullah Khan, Talha Uddin Sheikh, Didier Stricker, Muhammad Zeshan Afzal

TL;DR

This work tackles zero-shot learning for document image classification, an area with limited standardized evaluation. It introduces CICA, a CLIP-based framework augmented with a content module that processes OCR-derived text and a coupled-contrastive loss to align content with image and text features. On RVL-CDIP, CICA achieves an average ZSL top-1 improvement of 6.7 percentage points and a GZSL harmonic mean improvement of 24 percentage points, while adding only about 3.3% more parameters. The approach, along with newly proposed ZSL/GZSL splits and comprehensive ablations, establishes a practical path toward robust zero-shot document classification and invites further exploration of multimodal content integrations.

Abstract

Zero-shot learning has been extensively investigated in the broader field of visual recognition, attracting significant interest recently. However, the current work on zero-shot learning in document image classification remains scarce. The existing studies either focus exclusively on zero-shot inference, or their evaluation does not align with the established criteria of zero-shot evaluation in the visual recognition domain. We provide a comprehensive document image classification analysis in Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL) settings to address this gap. Our methodology and evaluation align with the established practices of this domain. Additionally, we propose zero-shot splits for the RVL-CDIP dataset. Furthermore, we introduce CICA (pronounced 'ki-ka'), a framework that enhances the zero-shot learning capabilities of CLIP. CICA consists of a novel 'content module' designed to leverage any generic document-related textual information. The discriminative features extracted by this module are aligned with CLIP's text and image features using a novel 'coupled-contrastive' loss. Our module improves CLIP's ZSL top-1 accuracy by 6.7% and GZSL harmonic mean by 24% on the RVL-CDIP dataset. Our module is lightweight and adds only 3.3% more parameters to CLIP. Our work sets the direction for future research in zero-shot document classification.

CICA: Content-Injected Contrastive Alignment for Zero-Shot Document Image Classification

TL;DR

This work tackles zero-shot learning for document image classification, an area with limited standardized evaluation. It introduces CICA, a CLIP-based framework augmented with a content module that processes OCR-derived text and a coupled-contrastive loss to align content with image and text features. On RVL-CDIP, CICA achieves an average ZSL top-1 improvement of 6.7 percentage points and a GZSL harmonic mean improvement of 24 percentage points, while adding only about 3.3% more parameters. The approach, along with newly proposed ZSL/GZSL splits and comprehensive ablations, establishes a practical path toward robust zero-shot document classification and invites further exploration of multimodal content integrations.

Abstract

Zero-shot learning has been extensively investigated in the broader field of visual recognition, attracting significant interest recently. However, the current work on zero-shot learning in document image classification remains scarce. The existing studies either focus exclusively on zero-shot inference, or their evaluation does not align with the established criteria of zero-shot evaluation in the visual recognition domain. We provide a comprehensive document image classification analysis in Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL) settings to address this gap. Our methodology and evaluation align with the established practices of this domain. Additionally, we propose zero-shot splits for the RVL-CDIP dataset. Furthermore, we introduce CICA (pronounced 'ki-ka'), a framework that enhances the zero-shot learning capabilities of CLIP. CICA consists of a novel 'content module' designed to leverage any generic document-related textual information. The discriminative features extracted by this module are aligned with CLIP's text and image features using a novel 'coupled-contrastive' loss. Our module improves CLIP's ZSL top-1 accuracy by 6.7% and GZSL harmonic mean by 24% on the RVL-CDIP dataset. Our module is lightweight and adds only 3.3% more parameters to CLIP. Our work sets the direction for future research in zero-shot document classification.
Paper Structure (24 sections, 12 equations, 4 figures, 5 tables)

This paper contains 24 sections, 12 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of CICA's architecture and training paradigm, highlighting how image and text features from CLIP are integrated with the content module's features through a coupled contrastive loss. This aligns positive image-content and text-content pairs (highlighted in the matrix) and distances negative pairs (lightly shaded in the matrix). $N$ denotes the batch size.
  • Figure 2: Illustration showing the inference logic for CICA. At test time, the learned content encoder, along with the CLIP text encoder, synthesizes a zero-shot linear classifier by embedding a prompt with the names of the test set's classes along with the content for the test sample. Here, $N$ is the number of classes.
  • Figure 3: Rank ordering of class-wise top-1 accuracies obtained from the frozen CLIP model, for the RVL-CDIP dataset. The top axis also shows the zone of the varying number of unseen classes in the incremental splits ($S_i^I$).
  • Figure 4: Comparative analysis of CLIP and CICA on RVL-CDIP incremental splits, showing CICA outperforming CLIP. Dashed lines indicate GZSL outcomes as harmonic mean, while solid lines depict ZSL performance using Top-1 accuracy. We observe that CICA consistently outperforms CLIP.