Table of Contents
Fetching ...

Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification

Christos Constantinou, Georgios Ioannides, Aman Chadha, Aaron Elkins, Edwin Simpson

TL;DR

The paper tackles OOD detection for multimodal document classification by introducing Attention Head Masking (AHM), a post-training inference technique that masks a subset of transformer attention heads to produce embeddings with enhanced ID/OOD separability for distance-based detectors. It also releases FinanceDocs, a high-quality digital multi-modal document dataset to benchmark OOD methods in this domain. Empirically, AHM improves AUROC and lowers FPR across kNN, Mahalanobis, and ensemble variants on Tobacco3482 and FinanceDocs, with strong cross-dataset performance. The work demonstrates that inference-time masking of attention mechanics can yield robust OOD detection without changing training, offering practical benefits for reliable document classification in real-world settings.

Abstract

Detecting out-of-distribution (OOD) data is crucial in machine learning applications to mitigate the risk of model overconfidence, thereby enhancing the reliability and safety of deployed systems. The majority of existing OOD detection methods predominantly address uni-modal inputs, such as images or texts. In the context of multi-modal documents, there is a notable lack of extensive research on the performance of these methods, which have primarily been developed with a focus on computer vision tasks. We propose a novel methodology termed as attention head masking (AHM) for multi-modal OOD tasks in document classification systems. Our empirical results demonstrate that the proposed AHM method outperforms all state-of-the-art approaches and significantly decreases the false positive rate (FPR) compared to existing solutions up to 7.5\%. This methodology generalizes well to multi-modal data, such as documents, where visual and textual information are modeled under the same Transformer architecture. To address the scarcity of high-quality publicly available document datasets and encourage further research on OOD detection for documents, we introduce FinanceDocs, a new document AI dataset. Our code and dataset are publicly available.

Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification

TL;DR

The paper tackles OOD detection for multimodal document classification by introducing Attention Head Masking (AHM), a post-training inference technique that masks a subset of transformer attention heads to produce embeddings with enhanced ID/OOD separability for distance-based detectors. It also releases FinanceDocs, a high-quality digital multi-modal document dataset to benchmark OOD methods in this domain. Empirically, AHM improves AUROC and lowers FPR across kNN, Mahalanobis, and ensemble variants on Tobacco3482 and FinanceDocs, with strong cross-dataset performance. The work demonstrates that inference-time masking of attention mechanics can yield robust OOD detection without changing training, offering practical benefits for reliable document classification in real-world settings.

Abstract

Detecting out-of-distribution (OOD) data is crucial in machine learning applications to mitigate the risk of model overconfidence, thereby enhancing the reliability and safety of deployed systems. The majority of existing OOD detection methods predominantly address uni-modal inputs, such as images or texts. In the context of multi-modal documents, there is a notable lack of extensive research on the performance of these methods, which have primarily been developed with a focus on computer vision tasks. We propose a novel methodology termed as attention head masking (AHM) for multi-modal OOD tasks in document classification systems. Our empirical results demonstrate that the proposed AHM method outperforms all state-of-the-art approaches and significantly decreases the false positive rate (FPR) compared to existing solutions up to 7.5\%. This methodology generalizes well to multi-modal data, such as documents, where visual and textual information are modeled under the same Transformer architecture. To address the scarcity of high-quality publicly available document datasets and encourage further research on OOD detection for documents, we introduce FinanceDocs, a new document AI dataset. Our code and dataset are publicly available.
Paper Structure (16 sections, 11 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Visual demonstration of AHM on a transformer-based model: For each attention layer, we utilize the corresponding attention head mask from the AHM matrix. Following query-key multiplication and the subsequent softmax operation, the resulting attention scores undergo element-wise multiplication with the relevant attention head mask. This process effectively reduces the attention scores of certain heads to zero, thereby inhibiting the propagation of their respective information through the value matrix.
  • Figure 2: Examples of SEC form documents.
  • Figure 3: Examples of shareholder letter documents.
  • Figure 4: Examples of SEC letter documents.
  • Figure 5: Examples of SEC-13 form documents.
  • ...and 6 more figures