Table of Contents
Fetching ...

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

Jaemin Son, Sujin Choi, Inyong Yun

TL;DR

A lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing, which substantially lowers computational costs, while maintaining comparable accuracy.

Abstract

Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

TL;DR

A lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing, which substantially lowers computational costs, while maintaining comparable accuracy.

Abstract

Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.

Paper Structure

This paper contains 17 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Illustration of the Index-Preserving Lightweight Token Pruning framework. The proposed framework consists of three components: a binary text-region classifier (green), index-preserving token pruning (blue) and a frozen off-the-shelf VLM (gray). Text-region patches are selected and fed into an off-the-shelf VLM.
  • Figure 2: Text recovery with max-pooling.
  • Figure 3: Visual token and TFLOPs statistics. Metrics are averaged over all images in each dataset. TFLOPs are counted end-to-end including the computation of the text-region classifier.