Intermittent Semi-Working Mask: A New Masking Paradigm for LLMs
HaoYuan Hu, Mingcong Lu, Di Luo, XinYa Wu, Jiangcai Zhu, Taoye Yin, Zheng Li, Hao Wang, Shusheng Zhang, KeZun Zhang, KaiLai Shao, Chao Chen, Feng Wang
TL;DR
The paper tackles the difficulty of maintaining high-quality, low-latency generation in long-context multi-turn dialogues by introducing Intermittent Semi-working Mask (ISM). ISM alternates bidirectional attention within each query segment and causal attention within answer segments, described mathematically by $\mathbf{x}_j \gets \mathbf{x}_j + \mathbf{O} \mathbf{V} \sum_{i=1}^{f(j)} \mathbf{x}_i (\mathbf{x}_i^\top \mathbf{K}^\top \mathbf{Q} \mathbf{x}_j)$ with a segment function $f(j)$, enabling prefix-like contextual synthesis while preserving KV-cache reuse for inference. The authors prove that ISM recovers batch-gradient-descent dynamics for the final query, achieving linear convergence to the optimum $\mathbf{w}^*$, and conserves online updates for earlier tokens. Empirically, ISM improves quality on MT-Eval, BotChat, and MATH benchmarks across LLaMA and Qwen with latency close to causal baselines, and it is architecture-agnostic and parameter-free, making it suitable for broad deployment.
Abstract
Multi-turn dialogues and context-intensive tasks challenge Large Language Models (LLMs) to integrate long histories without sacrificing generation quality. Although prefix LLMs can better exploit historical context via bidirectional attention on prefix tokens, they are rarely used in practice because multi-turn training requires many duplicated triplets, and its bidirectional prefix prevents KV-cache reuse at inference time, driving up high cost and latency. To retain the contextual understanding of prefix mask while preserving the inference-time efficiency of causal mask, we introduce Intermittent Semi-working Mask (ISM), a masking scheme that injects sparse bidirectional attention into the causal backbone. ISM alternates bidirectional attention over query segments with unidirectional attention over answer segments, enabling the synthesis of in-context while preserving global causality. This design eliminates triplet expansion during training and maintains KV-cache reuse during inference, yielding latency comparable to standard causal LLMs. ISM is architecture-agnostic and parameter-free, adding only minimal latency. Across extensive evaluations, ISM outperforms causal baselines not only on multi-turn dialogue, but also on context-intensive tasks like mathematical reasoning.
