HDT: Hierarchical Document Transformer
Haoyu He, Markus Flicke, Jan Buchmann, Iryna Gurevych, Andreas Geiger
TL;DR
This work tackles the challenge of efficiently encoding long, structured documents with Transformers by introducing the Hierarchical Document Transformer (HDT). HDT uses auxiliary anchor tokens to represent document structure (DOC, SEC, SENT) and a hierarchical, sparse attention pattern that allows information to flow across levels while maintaining computational efficiency, implemented via a GPU-optimized kernel with per-sample sparsity. The approach yields faster pretraining convergence and improved performance on a range of downstream tasks, including mathematical reasoning, scientific document understanding, summarization, QA, and NLI, compared to strong baselines like Longformer and HAT. Empirical results across ListOps, SciRepEval, FacetSum, and SCROLLS, along with efficiency analyses, demonstrate the practical impact of explicitly leveraging document structure for long-document modeling and highlight future directions in hierarchical decoding and scaling behavior.
Abstract
In this paper, we propose the Hierarchical Document Transformer (HDT), a novel sparse Transformer architecture tailored for structured hierarchical documents. Such documents are extremely important in numerous domains, including science, law or medicine. However, most existing solutions are inefficient and fail to make use of the structure inherent to documents. HDT exploits document structure by introducing auxiliary anchor tokens and redesigning the attention mechanism into a sparse multi-level hierarchy. This approach facilitates information exchange between tokens at different levels while maintaining sparsity, thereby enhancing computational and memory efficiency while exploiting the document structure as an inductive bias. We address the technical challenge of implementing HDT's sample-dependent hierarchical attention pattern by developing a novel sparse attention kernel that considers the hierarchical structure of documents. As demonstrated by our experiments, utilizing structural information present in documents leads to faster convergence, higher sample efficiency and better performance on downstream tasks.
