CRMSP: A Semi-supervised Approach for Key Information Extraction with Class-Rebalancing and Merged Semantic Pseudo-Labeling
Qi Zhang, Yonghong Song, Pengcheng Guo, Yangyang Hui
TL;DR
This work tackles Key Information Extraction under limited labeling by introducing CRMSP, a semi-supervised framework that specifically addresses long-tailed distributions and tail-class separation. It integrates Class-Rebalancing Pseudo-Labeling (CRP) to upweight tail pseudo-labels and Merged Semantic Pseudo-Labeling (MSP) to cluster tail features into merged prototypes, guided by an EMA teacher and a memory bank. The method optimizes a three-term loss that combines supervised, unsupervised, and contrastive objectives, and shows state-of-the-art results on FUNSD, CORD, and long-tailed CV benchmarks, including a $+3.24\%$ F1 gain on CORD over prior SOTA. The findings demonstrate effective utilization of large unlabeled corpora to reduce annotation costs while delivering strong tail-class performance, with potential applicability beyond KIE to other multimodal, imbalanced tasks.
Abstract
There is a growing demand in the field of KIE (Key Information Extraction) to apply semi-supervised learning to save manpower and costs, as training document data using fully-supervised methods requires labor-intensive manual annotation. The main challenges of applying SSL in the KIE are (1) underestimation of the confidence of tail classes in the long-tailed distribution and (2) difficulty in achieving intra-class compactness and inter-class separability of tail features. To address these challenges, we propose a novel semi-supervised approach for KIE with Class-Rebalancing and Merged Semantic Pseudo-Labeling (CRMSP). Firstly, the Class-Rebalancing Pseudo-Labeling (CRP) module introduces a reweighting factor to rebalance pseudo-labels, increasing attention to tail classes. Secondly, we propose the Merged Semantic Pseudo-Labeling (MSP) module to cluster tail features of unlabeled data by assigning samples to Merged Prototypes (MP). Additionally, we designed a new contrastive loss specifically for MSP. Extensive experimental results on three well-known benchmarks demonstrate that CRMSP achieves state-of-the-art performance. Remarkably, CRMSP achieves 3.24% f1-score improvement over state-of-the-art on the CORD.
