Table of Contents
Fetching ...

HDT: Hierarchical Document Transformer

Haoyu He, Markus Flicke, Jan Buchmann, Iryna Gurevych, Andreas Geiger

TL;DR

This work tackles the challenge of efficiently encoding long, structured documents with Transformers by introducing the Hierarchical Document Transformer (HDT). HDT uses auxiliary anchor tokens to represent document structure (DOC, SEC, SENT) and a hierarchical, sparse attention pattern that allows information to flow across levels while maintaining computational efficiency, implemented via a GPU-optimized kernel with per-sample sparsity. The approach yields faster pretraining convergence and improved performance on a range of downstream tasks, including mathematical reasoning, scientific document understanding, summarization, QA, and NLI, compared to strong baselines like Longformer and HAT. Empirical results across ListOps, SciRepEval, FacetSum, and SCROLLS, along with efficiency analyses, demonstrate the practical impact of explicitly leveraging document structure for long-document modeling and highlight future directions in hierarchical decoding and scaling behavior.

Abstract

In this paper, we propose the Hierarchical Document Transformer (HDT), a novel sparse Transformer architecture tailored for structured hierarchical documents. Such documents are extremely important in numerous domains, including science, law or medicine. However, most existing solutions are inefficient and fail to make use of the structure inherent to documents. HDT exploits document structure by introducing auxiliary anchor tokens and redesigning the attention mechanism into a sparse multi-level hierarchy. This approach facilitates information exchange between tokens at different levels while maintaining sparsity, thereby enhancing computational and memory efficiency while exploiting the document structure as an inductive bias. We address the technical challenge of implementing HDT's sample-dependent hierarchical attention pattern by developing a novel sparse attention kernel that considers the hierarchical structure of documents. As demonstrated by our experiments, utilizing structural information present in documents leads to faster convergence, higher sample efficiency and better performance on downstream tasks.

HDT: Hierarchical Document Transformer

TL;DR

This work tackles the challenge of efficiently encoding long, structured documents with Transformers by introducing the Hierarchical Document Transformer (HDT). HDT uses auxiliary anchor tokens to represent document structure (DOC, SEC, SENT) and a hierarchical, sparse attention pattern that allows information to flow across levels while maintaining computational efficiency, implemented via a GPU-optimized kernel with per-sample sparsity. The approach yields faster pretraining convergence and improved performance on a range of downstream tasks, including mathematical reasoning, scientific document understanding, summarization, QA, and NLI, compared to strong baselines like Longformer and HAT. Empirical results across ListOps, SciRepEval, FacetSum, and SCROLLS, along with efficiency analyses, demonstrate the practical impact of explicitly leveraging document structure for long-document modeling and highlight future directions in hierarchical decoding and scaling behavior.

Abstract

In this paper, we propose the Hierarchical Document Transformer (HDT), a novel sparse Transformer architecture tailored for structured hierarchical documents. Such documents are extremely important in numerous domains, including science, law or medicine. However, most existing solutions are inefficient and fail to make use of the structure inherent to documents. HDT exploits document structure by introducing auxiliary anchor tokens and redesigning the attention mechanism into a sparse multi-level hierarchy. This approach facilitates information exchange between tokens at different levels while maintaining sparsity, thereby enhancing computational and memory efficiency while exploiting the document structure as an inductive bias. We address the technical challenge of implementing HDT's sample-dependent hierarchical attention pattern by developing a novel sparse attention kernel that considers the hierarchical structure of documents. As demonstrated by our experiments, utilizing structural information present in documents leads to faster convergence, higher sample efficiency and better performance on downstream tasks.
Paper Structure (14 sections, 5 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 14 sections, 5 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: (\ref{['fig:model_architecture']}) We propose a sparse attention kernel that considers the hierarchical structure of documents. Here, regular tokens are illustrated in green, and auxiliary anchor tokens in yellow (document), red (section) and blue (sentence). Each token attends to its parent, siblings and children. Cross-level attention is illustrated using color gradients in the attention matrix. Utilizing structural information present in documents leads to faster pre-training (\ref{['fig:convergence']}) and better performance on downstream tasks. We use the held-out validation set in (\ref{['fig:convergence']}) to calculate the MLM loss.
  • Figure 2: Hierarchical Document Decomposition. Left: Tree representation of a document. Tokens within the same box attend to each other. Tokens that do not share a box attend to each other only indirectly (e.g., T1 and T3 via the sentence and section tokens). Right: Sparse attention matrix.
  • Figure 3: Hierarchical Positional Encoding. We represent the position of each token in the hierarchy with one linear index $p^l$ per hierarchy level $l$ yielding an index vector $\mathbf{p}=(p^1,\dots,p^L)^T$. Above, we show an example with $L=3$ levels. Note that level 0 (document) does not require an index. Each index in $\mathbf{p}$ is passed through sinusoidal encoding functions which are summed over all levels to form the final encoding vector according to Eq. \ref{['eq:pos_enc']}.
  • Figure 4: Hierarchial Attention Kernel. We copy queries, keys and values block-wise to SRAM for fast attention computation using a fused kernel. To increase the number of empty blocks that can be skipped, we sort keys and values based on their hierarchy level. Larger examples are shown in Fig. \ref{['fig:full_attn_mask']}.
  • Figure 5: ListOps Sample
  • ...and 6 more figures