Table of Contents
Fetching ...

Domain-specific long text classification from sparse relevant information

Célia D'Cruz, Jean-Marc Bereder, Frédéric Precioso, Michel Riveill

TL;DR

This work tackles long-domain medical document classification where critical information is sparsely distributed. It introduces a hierarchical model that filters documents using a compact target-term vocabulary to extract sentence-level context, then aggregates target-term embeddings with attention to form a document representation for classification. The approach is evaluated on a public English smoking-status benchmark and a private French colorectal cancer dataset, showing clear advantages over baselines including long-range transformers and other hierarchical models, especially under high sparsity and with robust vocabulary design. A second loss with expert-refined filtering and an active-learning path further improves performance, highlighting the value of human-in-the-loop refinement in domain-specific NLP. The results suggest practical benefits for healthcare decision support, achieved with smaller, more tractable models that respect patient privacy and computational constraints.

Abstract

Large Language Models have undoubtedly revolutionized the Natural Language Processing field, the current trend being to promote one-model-for-all tasks (sentiment analysis, translation, etc.). However, the statistical mechanisms at work in the larger language models struggle to exploit the relevant information when it is very sparse, when it is a weak signal. This is the case, for example, for the classification of long domain-specific documents, when the relevance relies on a single relevant word or on very few relevant words from technical jargon. In the medical domain, it is essential to determine whether a given report contains critical information about a patient's condition. This critical information is often based on one or few specific isolated terms. In this paper, we propose a hierarchical model which exploits a short list of potential target terms to retrieve candidate sentences and represent them into the contextualized embedding of the target term(s) they contain. A pooling of the term(s) embedding(s) entails the document representation to be classified. We evaluate our model on one public medical document benchmark in English and on one private French medical dataset. We show that our narrower hierarchical model is better than larger language models for retrieving relevant long documents in a domain-specific context.

Domain-specific long text classification from sparse relevant information

TL;DR

This work tackles long-domain medical document classification where critical information is sparsely distributed. It introduces a hierarchical model that filters documents using a compact target-term vocabulary to extract sentence-level context, then aggregates target-term embeddings with attention to form a document representation for classification. The approach is evaluated on a public English smoking-status benchmark and a private French colorectal cancer dataset, showing clear advantages over baselines including long-range transformers and other hierarchical models, especially under high sparsity and with robust vocabulary design. A second loss with expert-refined filtering and an active-learning path further improves performance, highlighting the value of human-in-the-loop refinement in domain-specific NLP. The results suggest practical benefits for healthcare decision support, achieved with smaller, more tractable models that respect patient privacy and computational constraints.

Abstract

Large Language Models have undoubtedly revolutionized the Natural Language Processing field, the current trend being to promote one-model-for-all tasks (sentiment analysis, translation, etc.). However, the statistical mechanisms at work in the larger language models struggle to exploit the relevant information when it is very sparse, when it is a weak signal. This is the case, for example, for the classification of long domain-specific documents, when the relevance relies on a single relevant word or on very few relevant words from technical jargon. In the medical domain, it is essential to determine whether a given report contains critical information about a patient's condition. This critical information is often based on one or few specific isolated terms. In this paper, we propose a hierarchical model which exploits a short list of potential target terms to retrieve candidate sentences and represent them into the contextualized embedding of the target term(s) they contain. A pooling of the term(s) embedding(s) entails the document representation to be classified. We evaluate our model on one public medical document benchmark in English and on one private French medical dataset. We show that our narrower hierarchical model is better than larger language models for retrieving relevant long documents in a domain-specific context.
Paper Structure (26 sections, 1 figure, 7 tables)