Table of Contents
Fetching ...

Mixture of In-Context Experts Enhance LLMs' Long Context Awareness

Hongzhan Lin, Ang Lv, Yuhan Chen, Chen Zhu, Yang Song, Hengshu Zhu, Rui Yan

TL;DR

When applied to open-source LLMs including Llama and Mistral, MoICE surpasses prior methods across multiple tasks on long context understanding and generation, all while maintaining commendable inference efficiency.

Abstract

Many studies have revealed that large language models (LLMs) exhibit uneven awareness of different contextual positions. Their limited context awareness can lead to overlooking critical information and subsequent task failures. While several approaches have been proposed to enhance LLMs' context awareness, achieving both effectiveness and efficiency remains challenging. In this paper, for LLMs utilizing RoPE as position embeddings, we introduce a novel method called "Mixture of In-Context Experts" (MoICE) to address this challenge. MoICE comprises two key components: a router integrated into each attention head within LLMs and a lightweight router-only training optimization strategy: (1) MoICE views each RoPE angle as an `in-context' expert, demonstrated to be capable of directing the attention of a head to specific contextual positions. Consequently, each attention head flexibly processes tokens using multiple RoPE angles dynamically selected by the router to attend to the needed positions. This approach mitigates the risk of overlooking essential contextual information. (2) The router-only training strategy entails freezing LLM parameters and exclusively updating routers for only a few steps. When applied to open-source LLMs including Llama and Mistral, MoICE surpasses prior methods across multiple tasks on long context understanding and generation, all while maintaining commendable inference efficiency.

Mixture of In-Context Experts Enhance LLMs' Long Context Awareness

TL;DR

When applied to open-source LLMs including Llama and Mistral, MoICE surpasses prior methods across multiple tasks on long context understanding and generation, all while maintaining commendable inference efficiency.

Abstract

Many studies have revealed that large language models (LLMs) exhibit uneven awareness of different contextual positions. Their limited context awareness can lead to overlooking critical information and subsequent task failures. While several approaches have been proposed to enhance LLMs' context awareness, achieving both effectiveness and efficiency remains challenging. In this paper, for LLMs utilizing RoPE as position embeddings, we introduce a novel method called "Mixture of In-Context Experts" (MoICE) to address this challenge. MoICE comprises two key components: a router integrated into each attention head within LLMs and a lightweight router-only training optimization strategy: (1) MoICE views each RoPE angle as an `in-context' expert, demonstrated to be capable of directing the attention of a head to specific contextual positions. Consequently, each attention head flexibly processes tokens using multiple RoPE angles dynamically selected by the router to attend to the needed positions. This approach mitigates the risk of overlooking essential contextual information. (2) The router-only training strategy entails freezing LLM parameters and exclusively updating routers for only a few steps. When applied to open-source LLMs including Llama and Mistral, MoICE surpasses prior methods across multiple tasks on long context understanding and generation, all while maintaining commendable inference efficiency.
Paper Structure (31 sections, 12 equations, 6 figures, 11 tables)

This paper contains 31 sections, 12 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Some methods developed to enhance LLMs' context awareness. (a) Attention Buckets chen2023fortify selects $N$ different RoPEs and conducts $N$ parallel inferences for each input. The outputs are then aggregated in the final layer. (b) Ms-PoE zhang2024found employs a unique RoPE angle for each attention head. However, it needs an additional forward pass for RoPE angle assignment. (c) MoICE integrates a router within each attention head. This novel plug-in selects several of the most suitable RoPE angles for each token. The selected RoPE angles collectively contribute to computing the attention scores. MoICE demonstrates superior memory efficiency and performance.
  • Figure 2: Different $\Theta_j$ alter the upper bounds of attention scores between a token and its $x$-distance neighbors. Each angle is distinguished by its own base value $B_j$.
  • Figure 3: The structure of MoICE. Only the router's parameters are trainable when plugged into an LLM. For clarity, the figure illustrates a single head, with $N$=3 and $K$=2 as toy demonstration examples.
  • Figure 4: The routing weights across two distinct attention heads at the 27th layer in Llama-2-7B-chat. The input tokens are randomly sampled from the training data, and the attention heads under observation are also randomly selected. The horizontal axis depicts the input tokens, while the vertical axis represents experts with varying RoPE angles. Due to their distinct functions, each head dynamically chooses different experts to process individual tokens. Input text can be found in Figure \ref{['fig:router_input']}.
  • Figure 5: The input text in Figure \ref{['fig:router']}. To clearly display, we only show part of the input text, where the text with a yellow background corresponds to the decoded tokens.
  • ...and 1 more figures