Table of Contents
Fetching ...

DistilDoc: Knowledge Distillation for Visually-Rich Document Applications

Jordy Van Landeghem, Subhajit Maity, Ayan Banerjee, Matthew Blaschko, Marie-Francine Moens, Josep Lladós, Sanket Biswas

TL;DR

DistilDoc investigates knowledge distillation as a path to efficient visually-rich document understanding by compressing backbones for DLA and DIC and by enriching downstream DocVQA prompts with logical layout information. It presents a comprehensive KD benchmarking framework across architectures (ResNet, ViT, DiT) and six KD methods (Vanilla, NKD, MSE, FitNet, ReviewKD, SimKD) evaluated on datasets like RVL-CDIP, DocLayNet, RVL-CDIP-N, Tobacco-3482, and PRImA, plus zero-shot DocVQA with LLMs. Key findings include a persistent teacher-student gap in DLA (roughly 8–10%), strong and consistent performance of SimKD for several backbones, and mixed results for DiT-based transfers, with DiT sometimes offering OOD robustness. The work also demonstrates that DLA-enriched prompting can modestly boost zero-shot DocVQA performance, motivating further exploration of layout-aware prompting and richer, more diverse datasets. An open-source KD benchmarking framework is released to guide future KD research in DU, highlighting the practical impact of efficient, layout-aware models in real-world document understanding pipelines.

Abstract

This work explores knowledge distillation (KD) for visually-rich document (VRD) applications such as document layout analysis (DLA) and document image classification (DIC). While VRD research is dependent on increasingly sophisticated and cumbersome models, the field has neglected to study efficiency via model compression. Here, we design a KD experimentation methodology for more lean, performant models on document understanding (DU) tasks that are integral within larger task pipelines. We carefully selected KD strategies (response-based, feature-based) for distilling knowledge to and from backbones with different architectures (ResNet, ViT, DiT) and capacities (base, small, tiny). We study what affects the teacher-student knowledge gap and find that some methods (tuned vanilla KD, MSE, SimKD with an apt projector) can consistently outperform supervised student training. Furthermore, we design downstream task setups to evaluate covariate shift and the robustness of distilled DLA models on zero-shot layout-aware document visual question answering (DocVQA). DLA-KD experiments result in a large mAP knowledge gap, which unpredictably translates to downstream robustness, accentuating the need to further explore how to efficiently obtain more semantic document layout awareness.

DistilDoc: Knowledge Distillation for Visually-Rich Document Applications

TL;DR

DistilDoc investigates knowledge distillation as a path to efficient visually-rich document understanding by compressing backbones for DLA and DIC and by enriching downstream DocVQA prompts with logical layout information. It presents a comprehensive KD benchmarking framework across architectures (ResNet, ViT, DiT) and six KD methods (Vanilla, NKD, MSE, FitNet, ReviewKD, SimKD) evaluated on datasets like RVL-CDIP, DocLayNet, RVL-CDIP-N, Tobacco-3482, and PRImA, plus zero-shot DocVQA with LLMs. Key findings include a persistent teacher-student gap in DLA (roughly 8–10%), strong and consistent performance of SimKD for several backbones, and mixed results for DiT-based transfers, with DiT sometimes offering OOD robustness. The work also demonstrates that DLA-enriched prompting can modestly boost zero-shot DocVQA performance, motivating further exploration of layout-aware prompting and richer, more diverse datasets. An open-source KD benchmarking framework is released to guide future KD research in DU, highlighting the practical impact of efficient, layout-aware models in real-world document understanding pipelines.

Abstract

This work explores knowledge distillation (KD) for visually-rich document (VRD) applications such as document layout analysis (DLA) and document image classification (DIC). While VRD research is dependent on increasingly sophisticated and cumbersome models, the field has neglected to study efficiency via model compression. Here, we design a KD experimentation methodology for more lean, performant models on document understanding (DU) tasks that are integral within larger task pipelines. We carefully selected KD strategies (response-based, feature-based) for distilling knowledge to and from backbones with different architectures (ResNet, ViT, DiT) and capacities (base, small, tiny). We study what affects the teacher-student knowledge gap and find that some methods (tuned vanilla KD, MSE, SimKD with an apt projector) can consistently outperform supervised student training. Furthermore, we design downstream task setups to evaluate covariate shift and the robustness of distilled DLA models on zero-shot layout-aware document visual question answering (DocVQA). DLA-KD experiments result in a large mAP knowledge gap, which unpredictably translates to downstream robustness, accentuating the need to further explore how to efficiently obtain more semantic document layout awareness.
Paper Structure (41 sections, 5 equations, 3 figures, 19 tables, 1 algorithm)

This paper contains 41 sections, 5 equations, 3 figures, 19 tables, 1 algorithm.

Figures (3)

  • Figure 1: DistilDoc presents the first framework to investigate the potential of KD-based DLA model compression to enrich LLM prompts with logical layout structure to practically and efficiently improve downstream applications such as DocVQA.
  • Figure 2: Proposed experimental methodology to comprehensively study all aspects (left-to-right) that impact KD methods (response, feature; projectors) adapted for VDU task specifics (architecture, weight initialization, pretraining & finetuning datasets, student capacity). Downstream setups evaluate the robustness of distilled students.
  • Figure :