Table of Contents
Fetching ...

CRMSP: A Semi-supervised Approach for Key Information Extraction with Class-Rebalancing and Merged Semantic Pseudo-Labeling

Qi Zhang, Yonghong Song, Pengcheng Guo, Yangyang Hui

TL;DR

This work tackles Key Information Extraction under limited labeling by introducing CRMSP, a semi-supervised framework that specifically addresses long-tailed distributions and tail-class separation. It integrates Class-Rebalancing Pseudo-Labeling (CRP) to upweight tail pseudo-labels and Merged Semantic Pseudo-Labeling (MSP) to cluster tail features into merged prototypes, guided by an EMA teacher and a memory bank. The method optimizes a three-term loss that combines supervised, unsupervised, and contrastive objectives, and shows state-of-the-art results on FUNSD, CORD, and long-tailed CV benchmarks, including a $+3.24\%$ F1 gain on CORD over prior SOTA. The findings demonstrate effective utilization of large unlabeled corpora to reduce annotation costs while delivering strong tail-class performance, with potential applicability beyond KIE to other multimodal, imbalanced tasks.

Abstract

There is a growing demand in the field of KIE (Key Information Extraction) to apply semi-supervised learning to save manpower and costs, as training document data using fully-supervised methods requires labor-intensive manual annotation. The main challenges of applying SSL in the KIE are (1) underestimation of the confidence of tail classes in the long-tailed distribution and (2) difficulty in achieving intra-class compactness and inter-class separability of tail features. To address these challenges, we propose a novel semi-supervised approach for KIE with Class-Rebalancing and Merged Semantic Pseudo-Labeling (CRMSP). Firstly, the Class-Rebalancing Pseudo-Labeling (CRP) module introduces a reweighting factor to rebalance pseudo-labels, increasing attention to tail classes. Secondly, we propose the Merged Semantic Pseudo-Labeling (MSP) module to cluster tail features of unlabeled data by assigning samples to Merged Prototypes (MP). Additionally, we designed a new contrastive loss specifically for MSP. Extensive experimental results on three well-known benchmarks demonstrate that CRMSP achieves state-of-the-art performance. Remarkably, CRMSP achieves 3.24% f1-score improvement over state-of-the-art on the CORD.

CRMSP: A Semi-supervised Approach for Key Information Extraction with Class-Rebalancing and Merged Semantic Pseudo-Labeling

TL;DR

This work tackles Key Information Extraction under limited labeling by introducing CRMSP, a semi-supervised framework that specifically addresses long-tailed distributions and tail-class separation. It integrates Class-Rebalancing Pseudo-Labeling (CRP) to upweight tail pseudo-labels and Merged Semantic Pseudo-Labeling (MSP) to cluster tail features into merged prototypes, guided by an EMA teacher and a memory bank. The method optimizes a three-term loss that combines supervised, unsupervised, and contrastive objectives, and shows state-of-the-art results on FUNSD, CORD, and long-tailed CV benchmarks, including a F1 gain on CORD over prior SOTA. The findings demonstrate effective utilization of large unlabeled corpora to reduce annotation costs while delivering strong tail-class performance, with potential applicability beyond KIE to other multimodal, imbalanced tasks.

Abstract

There is a growing demand in the field of KIE (Key Information Extraction) to apply semi-supervised learning to save manpower and costs, as training document data using fully-supervised methods requires labor-intensive manual annotation. The main challenges of applying SSL in the KIE are (1) underestimation of the confidence of tail classes in the long-tailed distribution and (2) difficulty in achieving intra-class compactness and inter-class separability of tail features. To address these challenges, we propose a novel semi-supervised approach for KIE with Class-Rebalancing and Merged Semantic Pseudo-Labeling (CRMSP). Firstly, the Class-Rebalancing Pseudo-Labeling (CRP) module introduces a reweighting factor to rebalance pseudo-labels, increasing attention to tail classes. Secondly, we propose the Merged Semantic Pseudo-Labeling (MSP) module to cluster tail features of unlabeled data by assigning samples to Merged Prototypes (MP). Additionally, we designed a new contrastive loss specifically for MSP. Extensive experimental results on three well-known benchmarks demonstrate that CRMSP achieves state-of-the-art performance. Remarkably, CRMSP achieves 3.24% f1-score improvement over state-of-the-art on the CORD.
Paper Structure (33 sections, 14 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 33 sections, 14 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Analysis on class distribution, and the dilemma of tail classes.
  • Figure 2: Comparison of precision, recall and f1-score of pseudo-labels generated by FixMatch and CRMSP. "PL" represents Pseudo-Labels.
  • Figure 3: Framework of the proposed Class-Rebalancing and Merged Semantic Pseudo-Labeling (CRMSP). Labeled and unlabeled samples are from the training data mini-batch.
  • Figure 4: Confusion matrixes of the predictions on the test dataset of FUNSD.
  • Figure 5: Comparison of t-SNE visualization of unlabeled data on the FUNSD.
  • ...and 5 more figures