Table of Contents
Fetching ...

Intermittent Semi-Working Mask: A New Masking Paradigm for LLMs

HaoYuan Hu, Mingcong Lu, Di Luo, XinYa Wu, Jiangcai Zhu, Taoye Yin, Zheng Li, Hao Wang, Shusheng Zhang, KeZun Zhang, KaiLai Shao, Chao Chen, Feng Wang

TL;DR

The paper tackles the difficulty of maintaining high-quality, low-latency generation in long-context multi-turn dialogues by introducing Intermittent Semi-working Mask (ISM). ISM alternates bidirectional attention within each query segment and causal attention within answer segments, described mathematically by $\mathbf{x}_j \gets \mathbf{x}_j + \mathbf{O} \mathbf{V} \sum_{i=1}^{f(j)} \mathbf{x}_i (\mathbf{x}_i^\top \mathbf{K}^\top \mathbf{Q} \mathbf{x}_j)$ with a segment function $f(j)$, enabling prefix-like contextual synthesis while preserving KV-cache reuse for inference. The authors prove that ISM recovers batch-gradient-descent dynamics for the final query, achieving linear convergence to the optimum $\mathbf{w}^*$, and conserves online updates for earlier tokens. Empirically, ISM improves quality on MT-Eval, BotChat, and MATH benchmarks across LLaMA and Qwen with latency close to causal baselines, and it is architecture-agnostic and parameter-free, making it suitable for broad deployment.

Abstract

Multi-turn dialogues and context-intensive tasks challenge Large Language Models (LLMs) to integrate long histories without sacrificing generation quality. Although prefix LLMs can better exploit historical context via bidirectional attention on prefix tokens, they are rarely used in practice because multi-turn training requires many duplicated triplets, and its bidirectional prefix prevents KV-cache reuse at inference time, driving up high cost and latency. To retain the contextual understanding of prefix mask while preserving the inference-time efficiency of causal mask, we introduce Intermittent Semi-working Mask (ISM), a masking scheme that injects sparse bidirectional attention into the causal backbone. ISM alternates bidirectional attention over query segments with unidirectional attention over answer segments, enabling the synthesis of in-context while preserving global causality. This design eliminates triplet expansion during training and maintains KV-cache reuse during inference, yielding latency comparable to standard causal LLMs. ISM is architecture-agnostic and parameter-free, adding only minimal latency. Across extensive evaluations, ISM outperforms causal baselines not only on multi-turn dialogue, but also on context-intensive tasks like mathematical reasoning.

Intermittent Semi-Working Mask: A New Masking Paradigm for LLMs

TL;DR

The paper tackles the difficulty of maintaining high-quality, low-latency generation in long-context multi-turn dialogues by introducing Intermittent Semi-working Mask (ISM). ISM alternates bidirectional attention within each query segment and causal attention within answer segments, described mathematically by with a segment function , enabling prefix-like contextual synthesis while preserving KV-cache reuse for inference. The authors prove that ISM recovers batch-gradient-descent dynamics for the final query, achieving linear convergence to the optimum , and conserves online updates for earlier tokens. Empirically, ISM improves quality on MT-Eval, BotChat, and MATH benchmarks across LLaMA and Qwen with latency close to causal baselines, and it is architecture-agnostic and parameter-free, making it suitable for broad deployment.

Abstract

Multi-turn dialogues and context-intensive tasks challenge Large Language Models (LLMs) to integrate long histories without sacrificing generation quality. Although prefix LLMs can better exploit historical context via bidirectional attention on prefix tokens, they are rarely used in practice because multi-turn training requires many duplicated triplets, and its bidirectional prefix prevents KV-cache reuse at inference time, driving up high cost and latency. To retain the contextual understanding of prefix mask while preserving the inference-time efficiency of causal mask, we introduce Intermittent Semi-working Mask (ISM), a masking scheme that injects sparse bidirectional attention into the causal backbone. ISM alternates bidirectional attention over query segments with unidirectional attention over answer segments, enabling the synthesis of in-context while preserving global causality. This design eliminates triplet expansion during training and maintains KV-cache reuse during inference, yielding latency comparable to standard causal LLMs. ISM is architecture-agnostic and parameter-free, adding only minimal latency. Across extensive evaluations, ISM outperforms causal baselines not only on multi-turn dialogue, but also on context-intensive tasks like mathematical reasoning.
Paper Structure (14 sections, 3 equations, 4 figures, 2 tables)

This paper contains 14 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of our Intermittent Semi-working Mask vs. existing Causal Mask and Prefix Mask.
  • Figure 2: Comparison between ISM and ChatGLM when facing multi-turn dialogue data. [M]:=[MASK], [S]:=[START].
  • Figure 3: The TTFT and TPOT latency comparison between LLaMA3.1-8B(SFT) and LLaMA3.1-8B(ISM).
  • Figure 4: Two cases generated from LLaMA3.1-8B(SFT) and LLaMA3.1-8B(ISM), with GPT-4 judgements.