Zero-Shot Prompting and Few-Shot Fine-Tuning: Revisiting Document Image Classification Using Large Language Models
Anna Scius-Bertrand, Michael Jungo, Lars Vögtlin, Jean-Marc Spat, Andreas Fischer
TL;DR
This work revisits document image classification under extreme data scarcity on RVL-CDIP by benchmarking zero-shot prompting, few-shot fine-tuning with LoRA, embedding-based methods, OCR-free image models, and multi-modal LLMs. It finds that zero-shot GPT-4-Vision achieving up to $69.9\%$ and a generatively fine-tuned Mistral-7B reaching $72.5\%$ with $160$ samples (and $83.4\%$ with $1600$) establish strong baselines, while embedding methods remain competitive at larger few-shot budgets. The results highlight the potential of document foundation models but also reveal limitations, particularly in OCR quality and the need for better prompts and multimodal integration. The study suggests future directions toward richer multimodal representations and self-supervised learning to further close the gap to fully supervised performance on large-scale data ($320{,}000$ samples) as in prior work.
Abstract
Classifying scanned documents is a challenging problem that involves image, layout, and text analysis for document understanding. Nevertheless, for certain benchmark datasets, notably RVL-CDIP, the state of the art is closing in to near-perfect performance when considering hundreds of thousands of training samples. With the advent of large language models (LLMs), which are excellent few-shot learners, the question arises to what extent the document classification problem can be addressed with only a few training samples, or even none at all. In this paper, we investigate this question in the context of zero-shot prompting and few-shot model fine-tuning, with the aim of reducing the need for human-annotated training samples as much as possible.
