Table of Contents
Fetching ...

LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning

Tiezhu Sun, Weiguo Pian, Nadia Daoudi, Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein

TL;DR

The paper tackles the problem of classifying very long texts under Transformer input limits by introducing LaFiCMIL, a Correlated Multiple Instance Learning approach that treats a large document as a bag of correlated chunks. It combines BERT-based chunk embeddings with a Nyström-based LaFiAttention mechanism and a learnable category vector to model inter-chunk correlations, enabling efficient single-GPU training and inference. Across seven public benchmarks, including long-document-heavy datasets, LaFiCMIL achieves state-of-the-art accuracy, notably improving the Paired Book Summary dataset and supporting sequences up to ~20k tokens. The approach demonstrates strong performance with practical resource requirements, offering a scalable solution for large-file classification and providing code and data to the community.

Abstract

Transfomer-based models have significantly advanced natural language processing, in particular the performance in text classification tasks. Nevertheless, these models face challenges in processing large files, primarily due to their input constraints, which are generally restricted to hundreds or thousands of tokens. Attempts to address this issue in existing models usually consist in extracting only a fraction of the essential information from lengthy inputs, while often incurring high computational costs due to their complex architectures. In this work, we address the challenge of classifying large files from the perspective of correlated multiple instance learning. We introduce LaFiCMIL, a method specifically designed for large file classification. LaFiCMIL is optimized for efficient operation on a single GPU, making it a versatile solution for binary, multi-class, and multi-label classification tasks. We conducted extensive experiments using seven diverse and comprehensive benchmark datasets to assess LaFiCMIL's effectiveness. By integrating BERT for feature extraction, LaFiCMIL demonstrates exceptional performance, setting new benchmarks across all datasets. A notable achievement of our approach is its ability to scale BERT to handle nearly 20,000 tokens while operating on a single GPU with 32GB of memory. This efficiency, coupled with its state-of-the-art performance, highlights LaFiCMIL's potential as a groundbreaking approach in the field of large file classification.

LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning

TL;DR

The paper tackles the problem of classifying very long texts under Transformer input limits by introducing LaFiCMIL, a Correlated Multiple Instance Learning approach that treats a large document as a bag of correlated chunks. It combines BERT-based chunk embeddings with a Nyström-based LaFiAttention mechanism and a learnable category vector to model inter-chunk correlations, enabling efficient single-GPU training and inference. Across seven public benchmarks, including long-document-heavy datasets, LaFiCMIL achieves state-of-the-art accuracy, notably improving the Paired Book Summary dataset and supporting sequences up to ~20k tokens. The approach demonstrates strong performance with practical resource requirements, offering a scalable solution for large-file classification and providing code and data to the community.

Abstract

Transfomer-based models have significantly advanced natural language processing, in particular the performance in text classification tasks. Nevertheless, these models face challenges in processing large files, primarily due to their input constraints, which are generally restricted to hundreds or thousands of tokens. Attempts to address this issue in existing models usually consist in extracting only a fraction of the essential information from lengthy inputs, while often incurring high computational costs due to their complex architectures. In this work, we address the challenge of classifying large files from the perspective of correlated multiple instance learning. We introduce LaFiCMIL, a method specifically designed for large file classification. LaFiCMIL is optimized for efficient operation on a single GPU, making it a versatile solution for binary, multi-class, and multi-label classification tasks. We conducted extensive experiments using seven diverse and comprehensive benchmark datasets to assess LaFiCMIL's effectiveness. By integrating BERT for feature extraction, LaFiCMIL demonstrates exceptional performance, setting new benchmarks across all datasets. A notable achievement of our approach is its ability to scale BERT to handle nearly 20,000 tokens while operating on a single GPU with 32GB of memory. This efficiency, coupled with its state-of-the-art performance, highlights LaFiCMIL's potential as a groundbreaking approach in the field of large file classification.
Paper Structure (14 sections, 3 theorems, 16 equations, 1 figure, 6 tables)

This paper contains 14 sections, 3 theorems, 16 equations, 1 figure, 6 tables.

Key Result

Theorem 1

Suppose $S : \chi \rightarrow \mathbb{R}$ is a continuous set function w.r.t Hausdorff distance rote1991computing$d_{H}(., .)$. $\forall \varepsilon > 0$, for any invertible map $P : \chi \rightarrow \mathbb{R}^{n}$, $\exists$ function $\sigma$ and $g$, such that for any set $X \in \chi$:

Figures (1)

  • Figure 1: LaFiCMIL. Initially, document chunks are transformed into embedding vectors using BERT. A learnable category vector is then concatenated to these embeddings to form an augmented bag $X_i^0$ with $n' = n + 1$ instances. The LaFiAttention layer captures the inter-instance correlations within $X_i^0$. Operations within this layer, such as matrix multiplication ($\times$) and addition ($+$), are specified alongside the variable names and matrix dimensions. Key processes include sMEANS for landmark selections similar to shen2018baseline, pINV for pseudoinverse approximation, and DConv for depth-wise convolution. Classification is completed by passing the learned category vector through a fully connected layer.

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • lemma thmcounterlemma