Table of Contents
Fetching ...

Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models

Yu Fu, Haz Sameen Shahgir, Hui Liu, Xianfeng Tang, Qi He, Yue Dong

TL;DR

This work investigates how intrinsic, parametric knowledge within long-context language models influences generation, particularly as input contexts become extremely long. It demonstrates a trade-off between intrinsic and extrinsic retrieval, showing that improvements in external context retrieval can inadvertently suppress the model’s use of its own knowledge. The authors introduce I-WhoQA-based datasets and the Hybrid Needle-in-a-Haystack framework to jointly assess intrinsic and extrinsic retrieval, revealing that certain families (e.g., Qwen2.5) better scale intrinsic retrieval with model size while others (e.g., Llama3.1) struggle to utilize intrinsic knowledge under long-context conditions. The findings underscore the need for dual-retrieval evaluation in LCMs and suggest that future work should balance external context use with intrinsic knowledge expression for more reliable long-context generation.

Abstract

Recent advances in long-context models (LCMs), designed to handle extremely long input contexts, primarily focus on utilizing external contextual information, often leaving the influence of large language models' intrinsic knowledge underexplored. In this work, we investigate how this intrinsic knowledge affects content generation and demonstrate that its impact becomes increasingly pronounced as context length extends. Furthermore, we show that the model's ability to utilize intrinsic knowledge, which we call intrinsic retrieval ability, does not improve simultaneously with its ability to leverage contextual knowledge through extrinsic retrieval ability. Moreover, better extrinsic retrieval can interfere with the model's ability to use its own knowledge effectively, limiting its full potential. To bridge this gap, we design a simple yet effective Hybrid Needle-in-a-Haystack test that evaluates models based on their capabilities across both retrieval abilities, rather than solely emphasizing extrinsic retrieval ability. Our experimental results reveal that Qwen-2.5 models significantly outperform Llama-3.1 models, demonstrating superior intrinsic retrieval ability. Moreover, even the more powerful Llama-3.1-70B-Instruct model fails to exhibit better performance under LCM conditions, highlighting the importance of evaluating models from a dual-retrieval perspective.

Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models

TL;DR

This work investigates how intrinsic, parametric knowledge within long-context language models influences generation, particularly as input contexts become extremely long. It demonstrates a trade-off between intrinsic and extrinsic retrieval, showing that improvements in external context retrieval can inadvertently suppress the model’s use of its own knowledge. The authors introduce I-WhoQA-based datasets and the Hybrid Needle-in-a-Haystack framework to jointly assess intrinsic and extrinsic retrieval, revealing that certain families (e.g., Qwen2.5) better scale intrinsic retrieval with model size while others (e.g., Llama3.1) struggle to utilize intrinsic knowledge under long-context conditions. The findings underscore the need for dual-retrieval evaluation in LCMs and suggest that future work should balance external context use with intrinsic knowledge expression for more reliable long-context generation.

Abstract

Recent advances in long-context models (LCMs), designed to handle extremely long input contexts, primarily focus on utilizing external contextual information, often leaving the influence of large language models' intrinsic knowledge underexplored. In this work, we investigate how this intrinsic knowledge affects content generation and demonstrate that its impact becomes increasingly pronounced as context length extends. Furthermore, we show that the model's ability to utilize intrinsic knowledge, which we call intrinsic retrieval ability, does not improve simultaneously with its ability to leverage contextual knowledge through extrinsic retrieval ability. Moreover, better extrinsic retrieval can interfere with the model's ability to use its own knowledge effectively, limiting its full potential. To bridge this gap, we design a simple yet effective Hybrid Needle-in-a-Haystack test that evaluates models based on their capabilities across both retrieval abilities, rather than solely emphasizing extrinsic retrieval ability. Our experimental results reveal that Qwen-2.5 models significantly outperform Llama-3.1 models, demonstrating superior intrinsic retrieval ability. Moreover, even the more powerful Llama-3.1-70B-Instruct model fails to exhibit better performance under LCM conditions, highlighting the importance of evaluating models from a dual-retrieval perspective.

Paper Structure

This paper contains 23 sections, 1 equation, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Left) LCMs struggle to retrieve the answer from the context on the I-WhoQA-Conflict-subset i.e. when the context information conflicts with its intrinsic knowledge. Right) The upward trend of I-WhoQA-conflict (red) shows that when the intrinsic knowledge conflicts with the context, the likelihood of a LCM relying on intrinsic knowledge steadily increases with larger contexts.
  • Figure 2: I-WhoQA-irrelevant subsets: LCMs are often unable to ignore context not relevant to the question and recall the answer from their intrinsic knowledge. As the length of the context is increased, LCMs gradually start ignoring the context and generating according to their intrinsic knowledge. However, using enhanced positional encoding designed for long context tasks such as STRING causes LCMs to re-focus on irrelevant context and generate wrong answers.
  • Figure 3: Performance of LCMs on HotPotQA-context and -internal subsets. Left) STRING improves performance on the HotPotQA-context subset by enhancing extrinsic retrieval ability. Right) STRING hinders the model's ability to recall intrinsic knowledge and leads to the decrease of performance.
  • Figure 4: Upper) Needle-in-a-Haystack. It involves directly inserting the answer into the haystack and retrieving it. Lower) Hybrid Needle-in-a-Haystack. It requires a two-step process: first, performing intrinsic retrieval to identify the retrieval target based on the model’s intrinsic knowledge, and then retrieving the answer from the haystack.
  • Figure 5: Hybrid NIAH test results. Upper) Qwen2.5-7B-Instruct-1M with generation length 32. Lower) Qwen2.5-72B-Instruct with generation lengths 32 and 64. The number of random facts was set to 0.
  • ...and 6 more figures