Table of Contents
Fetching ...

Multimodal Adaptive Inference for Document Image Classification with Anytime Early Exiting

Omar Hamed, Souhail Bakkali, Marie-Francine Moens, Matthew Blaschko, Jordy Van Landeghem

TL;DR

The paper tackles the efficiency bottleneck of multimodal document understanding by introducing a multimodal anytime early-exit (EE) architecture built on LayoutLMv3_Base. It systematically explores exit placements, exit classifiers (gates vs ramps), training strategies (including weighting and entropy regularization), and inference policies (global, multi-exit, and a novel heuristic) to achieve Pareto-efficient accuracy-latency trade-offs. Calibrating exit confidences substantially improves exiting decisions, and a proposed heuristic thresholding method demonstrates competitive performance without extra tuning. Empirically, the approach yields over $20\%$ latency reduction while largely preserving baseline accuracy on RVL-CDIP, highlighting practical potential for scalable, adaptive VDU deployments and laying groundwork for integrating complementary efficiency techniques in future work.

Abstract

This work addresses the need for a balanced approach between performance and efficiency in scalable production environments for visually-rich document understanding (VDU) tasks. Currently, there is a reliance on large document foundation models that offer advanced capabilities but come with a heavy computational burden. In this paper, we propose a multimodal early exit (EE) model design that incorporates various training strategies, exit layer types and placements. Our goal is to achieve a Pareto-optimal balance between predictive performance and efficiency for multimodal document image classification. Through a comprehensive set of experiments, we compare our approach with traditional exit policies and showcase an improved performance-efficiency trade-off. Our multimodal EE design preserves the model's predictive capabilities, enhancing both speed and latency. This is achieved through a reduction of over 20% in latency, while fully retaining the baseline accuracy. This research represents the first exploration of multimodal EE design within the VDU community, highlighting as well the effectiveness of calibration in improving confidence scores for exiting at different layers. Overall, our findings contribute to practical VDU applications by enhancing both performance and efficiency.

Multimodal Adaptive Inference for Document Image Classification with Anytime Early Exiting

TL;DR

The paper tackles the efficiency bottleneck of multimodal document understanding by introducing a multimodal anytime early-exit (EE) architecture built on LayoutLMv3_Base. It systematically explores exit placements, exit classifiers (gates vs ramps), training strategies (including weighting and entropy regularization), and inference policies (global, multi-exit, and a novel heuristic) to achieve Pareto-efficient accuracy-latency trade-offs. Calibrating exit confidences substantially improves exiting decisions, and a proposed heuristic thresholding method demonstrates competitive performance without extra tuning. Empirically, the approach yields over latency reduction while largely preserving baseline accuracy on RVL-CDIP, highlighting practical potential for scalable, adaptive VDU deployments and laying groundwork for integrating complementary efficiency techniques in future work.

Abstract

This work addresses the need for a balanced approach between performance and efficiency in scalable production environments for visually-rich document understanding (VDU) tasks. Currently, there is a reliance on large document foundation models that offer advanced capabilities but come with a heavy computational burden. In this paper, we propose a multimodal early exit (EE) model design that incorporates various training strategies, exit layer types and placements. Our goal is to achieve a Pareto-optimal balance between predictive performance and efficiency for multimodal document image classification. Through a comprehensive set of experiments, we compare our approach with traditional exit policies and showcase an improved performance-efficiency trade-off. Our multimodal EE design preserves the model's predictive capabilities, enhancing both speed and latency. This is achieved through a reduction of over 20% in latency, while fully retaining the baseline accuracy. This research represents the first exploration of multimodal EE design within the VDU community, highlighting as well the effectiveness of calibration in improving confidence scores for exiting at different layers. Overall, our findings contribute to practical VDU applications by enhancing both performance and efficiency.
Paper Structure (25 sections, 10 equations, 7 figures)

This paper contains 25 sections, 10 equations, 7 figures.

Figures (7)

  • Figure 1: Illustration of the proposed experimental methodology of a multi-modal multi-exit architecture for efficient document image classification. Every step highlights design choices that are benchmarked for achieving Pareto efficiency.
  • Figure 2: Performance-efficiency trade-offs that allow comparing models for when both criteria are important, which makes any point on the Pareto frontier to be considered Pareto-efficient vs. all below the frontier.
  • Figure 3: Effect of calibration for Ramp exits and training strategy.
  • Figure 4: Effect of calibration for Gate exits and training strategy
  • Figure 5: A comparison between the different exit policies, varying calibration.
  • ...and 2 more figures