Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models
Yu Fu, Haz Sameen Shahgir, Hui Liu, Xianfeng Tang, Qi He, Yue Dong
TL;DR
This work investigates how intrinsic, parametric knowledge within long-context language models influences generation, particularly as input contexts become extremely long. It demonstrates a trade-off between intrinsic and extrinsic retrieval, showing that improvements in external context retrieval can inadvertently suppress the model’s use of its own knowledge. The authors introduce I-WhoQA-based datasets and the Hybrid Needle-in-a-Haystack framework to jointly assess intrinsic and extrinsic retrieval, revealing that certain families (e.g., Qwen2.5) better scale intrinsic retrieval with model size while others (e.g., Llama3.1) struggle to utilize intrinsic knowledge under long-context conditions. The findings underscore the need for dual-retrieval evaluation in LCMs and suggest that future work should balance external context use with intrinsic knowledge expression for more reliable long-context generation.
Abstract
Recent advances in long-context models (LCMs), designed to handle extremely long input contexts, primarily focus on utilizing external contextual information, often leaving the influence of large language models' intrinsic knowledge underexplored. In this work, we investigate how this intrinsic knowledge affects content generation and demonstrate that its impact becomes increasingly pronounced as context length extends. Furthermore, we show that the model's ability to utilize intrinsic knowledge, which we call intrinsic retrieval ability, does not improve simultaneously with its ability to leverage contextual knowledge through extrinsic retrieval ability. Moreover, better extrinsic retrieval can interfere with the model's ability to use its own knowledge effectively, limiting its full potential. To bridge this gap, we design a simple yet effective Hybrid Needle-in-a-Haystack test that evaluates models based on their capabilities across both retrieval abilities, rather than solely emphasizing extrinsic retrieval ability. Our experimental results reveal that Qwen-2.5 models significantly outperform Llama-3.1 models, demonstrating superior intrinsic retrieval ability. Moreover, even the more powerful Llama-3.1-70B-Instruct model fails to exhibit better performance under LCM conditions, highlighting the importance of evaluating models from a dual-retrieval perspective.
