Table of Contents
Fetching ...

Zero-Shot Prompting and Few-Shot Fine-Tuning: Revisiting Document Image Classification Using Large Language Models

Anna Scius-Bertrand, Michael Jungo, Lars Vögtlin, Jean-Marc Spat, Andreas Fischer

TL;DR

This work revisits document image classification under extreme data scarcity on RVL-CDIP by benchmarking zero-shot prompting, few-shot fine-tuning with LoRA, embedding-based methods, OCR-free image models, and multi-modal LLMs. It finds that zero-shot GPT-4-Vision achieving up to $69.9\%$ and a generatively fine-tuned Mistral-7B reaching $72.5\%$ with $160$ samples (and $83.4\%$ with $1600$) establish strong baselines, while embedding methods remain competitive at larger few-shot budgets. The results highlight the potential of document foundation models but also reveal limitations, particularly in OCR quality and the need for better prompts and multimodal integration. The study suggests future directions toward richer multimodal representations and self-supervised learning to further close the gap to fully supervised performance on large-scale data ($320{,}000$ samples) as in prior work.

Abstract

Classifying scanned documents is a challenging problem that involves image, layout, and text analysis for document understanding. Nevertheless, for certain benchmark datasets, notably RVL-CDIP, the state of the art is closing in to near-perfect performance when considering hundreds of thousands of training samples. With the advent of large language models (LLMs), which are excellent few-shot learners, the question arises to what extent the document classification problem can be addressed with only a few training samples, or even none at all. In this paper, we investigate this question in the context of zero-shot prompting and few-shot model fine-tuning, with the aim of reducing the need for human-annotated training samples as much as possible.

Zero-Shot Prompting and Few-Shot Fine-Tuning: Revisiting Document Image Classification Using Large Language Models

TL;DR

This work revisits document image classification under extreme data scarcity on RVL-CDIP by benchmarking zero-shot prompting, few-shot fine-tuning with LoRA, embedding-based methods, OCR-free image models, and multi-modal LLMs. It finds that zero-shot GPT-4-Vision achieving up to and a generatively fine-tuned Mistral-7B reaching with samples (and with ) establish strong baselines, while embedding methods remain competitive at larger few-shot budgets. The results highlight the potential of document foundation models but also reveal limitations, particularly in OCR quality and the need for better prompts and multimodal integration. The study suggests future directions toward richer multimodal representations and self-supervised learning to further close the gap to fully supervised performance on large-scale data ( samples) as in prior work.

Abstract

Classifying scanned documents is a challenging problem that involves image, layout, and text analysis for document understanding. Nevertheless, for certain benchmark datasets, notably RVL-CDIP, the state of the art is closing in to near-perfect performance when considering hundreds of thousands of training samples. With the advent of large language models (LLMs), which are excellent few-shot learners, the question arises to what extent the document classification problem can be addressed with only a few training samples, or even none at all. In this paper, we investigate this question in the context of zero-shot prompting and few-shot model fine-tuning, with the aim of reducing the need for human-annotated training samples as much as possible.

Paper Structure

This paper contains 29 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Example document images from the RVL-CDIP dataset.
  • Figure 2: Three system prompts considered for document classification. Note that the list of 16 categories has been shortened in the figure, indicated by […], to save space.
  • Figure 3: Embedding of the $1\,600$ training samples using the OpenAI-large model, visualized with t-SNE.