Table of Contents
Fetching ...

HiCI: Hierarchical Construction-Integration for Long-Context Attention

Xiangyu Zeng, Qi Xu, Yunke Wang, Chang Xu

Abstract

Long-context language modeling is commonly framed as a scalability challenge of token-level attention, yet local-to-global information structuring remains largely implicit in existing approaches. Drawing on cognitive theories of discourse comprehension, we propose HiCI (Hierarchical Construction--Integration), a hierarchical attention module that constructs segment-level representations, integrates them into a shared global context, and broadcasts both to condition segment-level attention. We validate HiCI through parameter-efficient adaptation of LLaMA-2 with only <5.5% additional parameters, extending context from 4K to 100K tokens (7B) and 64K tokens (13B). Across language modeling, retrieval, and instruction-following benchmarks, HiCI yields consistent improvements over strong baselines, including matching proprietary models on topic retrieval and surpassing GPT-3.5-Turbo-16K on code comprehension. These results demonstrate the effectiveness of explicit hierarchical structuring as an inductive bias for long-context modeling.

HiCI: Hierarchical Construction-Integration for Long-Context Attention

Abstract

Long-context language modeling is commonly framed as a scalability challenge of token-level attention, yet local-to-global information structuring remains largely implicit in existing approaches. Drawing on cognitive theories of discourse comprehension, we propose HiCI (Hierarchical Construction--Integration), a hierarchical attention module that constructs segment-level representations, integrates them into a shared global context, and broadcasts both to condition segment-level attention. We validate HiCI through parameter-efficient adaptation of LLaMA-2 with only <5.5% additional parameters, extending context from 4K to 100K tokens (7B) and 64K tokens (13B). Across language modeling, retrieval, and instruction-following benchmarks, HiCI yields consistent improvements over strong baselines, including matching proprietary models on topic retrieval and surpassing GPT-3.5-Turbo-16K on code comprehension. These results demonstrate the effectiveness of explicit hierarchical structuring as an inductive bias for long-context modeling.
Paper Structure (49 sections, 1 theorem, 19 equations, 5 figures, 10 tables)

This paper contains 49 sections, 1 theorem, 19 equations, 5 figures, 10 tables.

Key Result

Theorem 1.1

HiCI achieves time complexity $O(TSd)$ and space complexity $O(S^2)$ per layer, linear in $T$ for fixed $S$. An additional $O((K{+}M)d)$ space is required for storing the hierarchical context, which is negligible for typical configurations ($K{+}M = 12$, $S \geq 1024$).

Figures (5)

  • Figure 1: Overview of HiCI.Left: HiCI integrated into a Transformer block; trainable components are highlighted. Right: HiCI constructs hierarchical context through three stages. (1) Local Construction: the input sequence is partitioned into $N$ segments, and cross-attention with $M$ learnable query slots extracts a local representation $L_i$ from each segment. (2) Global Integration: local representations $\{L_i\}_{i=1}^N$ are aggregated into a shared global context $G$ via multi-view statistical pooling and attention-based weighting. (3) Top-down Broadcast:$G$ and $L_i$ are prepended to each segment's key--value sequence, conditioning attention on hierarchical context while preserving parallelism across segments. At inference, HiCI is optionally applied during prefill, while autoregressive decoding uses standard attention.
  • Figure 2: Passkey retrieval accuracy for LongLoRA-7B, HiCI-7B (both fine-tuned at 32K), and base LLaMA-2-7B. HiCI achieves 100% accuracy within the training length and extrapolates more gracefully to 56K via position interpolation without additional fine-tuning.
  • Figure 3: Peak GPU memory (left) and wall-clock training time (right) for HiCI and LongLoRA (LLaMA-2-7B, 8$\times$H100-80GB, 1,000 steps; Stage-2 for 8K--64K, Stage-3 for 100K). The three-stage HiCI pipeline raises memory by 3.5--9.9%, which necessitates finer partitioning at long contexts ($N{=}10$ at 100K vs. LongLoRA's $N{=}4$); the resulting quadratic reduction in per-segment attention cost yields a 19.3% wall-clock speedup.
  • Figure 4: Training loss comparison between HiCI and LongLoRA on LLaMA-2-7B continual pre-training (RedPajama, 2,000 steps). Both methods are trained at 8K and 16K context with $S \in \{1024, 2048\}$. HiCI with $S{=}1024$ sustains optimization throughout, while HiCI with $S{=}2048$ and all LongLoRA variants plateau beyond step 1,000.
  • Figure 5: Layer-wise attention allocated to global slots during evaluation on PG-19. Background shading denotes depth groups: early (L0--7, blue), middle (L8--23, white), and deep (L24--31, red). (a) Comparison of segment sizes $S{=}1024$ and $S{=}2048$ under matched conditions (8K evaluation, 2K steps): finer segmentation yields substantially higher attention to global slots, with the final layer (L31) reaching 40.4% versus 12.7%. (b) Robustness at $S{=}2048$: varying evaluation length (8K vs. 4K) and training steps (2K vs. 1K) yields nearly identical layer-wise allocation patterns, with per-layer deviations within 1 percentage point. In both panels, attention to global slots increases toward deeper layers.

Theorems & Definitions (3)

  • Theorem 1.1: Linear Complexity
  • proof
  • Remark 1.2