Table of Contents
Fetching ...

Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document

Adnan Ben Mansour, Ayoub Karine, David Naccache

TL;DR

This work tackles the resource-intensity of OCR-free document VQA models by leveraging mechanistic interpretability (MI) to guide pruning and architecture design. Starting from the Donut teacher model, the authors perform MI-based analysis to identify essential sublayers and heads, then apply a two-stage pruning process followed by knowledge distillation to train compact students, resulting in Donut-MINT variants that maintain DocVQA accuracy with substantially fewer parameters and FLOPs. The key contributions are (i) a principled MI-based framework for pruning in Vision-Language Models, (ii) demonstration that MI-guided pruning outperforms brute-force pruning baselines, and (iii) a 7% parameter Donut-MINT that delivers competitive ANLS on DocVQA. This approach bridges interpretability research and practical VrDU deployment, offering a pathway toward automated, principled compression of multimodal models for real-time applications.

Abstract

Recent advances in Visually-rich Document Understanding rely on large Vision-Language Models like Donut, which perform document-level Visual Question Answering without Optical Character Recognition. Despite their effectiveness, these models are too costly for real-time or resource-constrained applications. We investigate model compression through knowledge distillation, training compact student models from a larger teacher. We leverage mechanistic interpretability to drive student architecture design within this framework. By analyzing internal computations, we identify essential subcomponents to retain, while having a clear view of which subcomponents should be approximated, skipped, or reparametrized based on their function. This approach yields Donut-MINT (Mechanistic Interpretability-based Network Trimming), a pruned Donut variant that reduces inference time and memory usage while maintaining strong performance on DocVQA, a standard benchmark for document Visual Question Answering. Our method reframes compression as circuit discovery, bridging interpretability research and practical Vision-Language Model deployment.

Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document

TL;DR

This work tackles the resource-intensity of OCR-free document VQA models by leveraging mechanistic interpretability (MI) to guide pruning and architecture design. Starting from the Donut teacher model, the authors perform MI-based analysis to identify essential sublayers and heads, then apply a two-stage pruning process followed by knowledge distillation to train compact students, resulting in Donut-MINT variants that maintain DocVQA accuracy with substantially fewer parameters and FLOPs. The key contributions are (i) a principled MI-based framework for pruning in Vision-Language Models, (ii) demonstration that MI-guided pruning outperforms brute-force pruning baselines, and (iii) a 7% parameter Donut-MINT that delivers competitive ANLS on DocVQA. This approach bridges interpretability research and practical VrDU deployment, offering a pathway toward automated, principled compression of multimodal models for real-time applications.

Abstract

Recent advances in Visually-rich Document Understanding rely on large Vision-Language Models like Donut, which perform document-level Visual Question Answering without Optical Character Recognition. Despite their effectiveness, these models are too costly for real-time or resource-constrained applications. We investigate model compression through knowledge distillation, training compact student models from a larger teacher. We leverage mechanistic interpretability to drive student architecture design within this framework. By analyzing internal computations, we identify essential subcomponents to retain, while having a clear view of which subcomponents should be approximated, skipped, or reparametrized based on their function. This approach yields Donut-MINT (Mechanistic Interpretability-based Network Trimming), a pruned Donut variant that reduces inference time and memory usage while maintaining strong performance on DocVQA, a standard benchmark for document Visual Question Answering. Our method reframes compression as circuit discovery, bridging interpretability research and practical Vision-Language Model deployment.

Paper Structure

This paper contains 25 sections, 5 theorems, 1 equation, 7 figures, 2 tables.

Key Result

proposition thmcounterproposition

The task of converting pixel values into discrete tokens is entirely done in the visual encoder up to a linear transformation. C3 puts attention on patches it wants to read, and heads C3.H4, C3.H7, C3.H8, C3.H9, C3.H10, and C3.H11 retrieve the token at that position.

Figures (7)

  • Figure 1: Our proposed method relies on the three stages highlighted in blue: MI-based importance estimation, pruning, and knowledge distillation. We also pruned Donut-base with other SOTA techniques to allow for a fair comparison.
  • Figure 2: Architecture overview of Donut-base. The model comprises a visual encoder and a decoder. The decoder consists of four layers, each layer $\ell \in \{0,1,2,3\}$ divided into three sub-layers: self-attention S$\ell$, cross-attention C$\ell$ bridging modalities, and feed-forward networks M$\ell$.
  • Figure 3: Impact of skipping decoder sub-layers in Donut-base measured by perplexity. Each sub-layer of the decoder is color-coded according to the increase in perplexity observed when that sub-layer is skipped, highlighting which components are critical to model performance.
  • Figure 4: (left) Projection of patch visual encoding through Swin and the linear transformation lm_head $\circ$ C3.out_proj $\circ$ C3.v_proj from the decoder into tokens. We colored in blue patches with a low entropy on the token distribution, and overlayed the most likely token for each. (right) We removed least important heads, leading to a clear improvement over text transcription fidelity. (bottom) This is a zoomed version to improve the lisibility of the text.
  • Figure 5: Accuracy versus perplexity threshold for the first generated token of the M3 sub-layer after activation patching. Results are shown on both train and validation splits, plotting the fraction of samples with perplexity below each threshold under three conditions: skipping M3 entirely, patching M3 with activations from the last cross-attention times value (C3.AV), and the clean run.
  • ...and 2 more figures

Theorems & Definitions (5)

  • proposition thmcounterproposition
  • proposition thmcounterproposition
  • proposition thmcounterproposition
  • proposition thmcounterproposition
  • proposition thmcounterproposition