Shakespearean Sparks: The Dance of Hallucination and Creativity in LLMs' Decoding Layers
Zicong He, Boxuan Zhang, Lu Cheng
TL;DR
The paper tackles the challenge of balancing creativity and hallucination in large language models by introducing Hallucination-Creativity Across Layers (HCL), a framework that quantitatively analyzes these traits layer-by-layer using Layer-Skip sampling. It defines a narrow, task-specific notion of creativity and pairs it with a hallucination metric to compute a Hallucination-Creativity Balanced (HCB) score, enabling identification of optimal decoding layers for different model architectures. Across multiple LLaMA-based models and QA datasets (TriviaQA, Natural Questions), the study finds a robust trade-off: higher creativity often accompanies more hallucination, and larger models amplify both effects, with optimal layers typically appearing in earlier depths rather than at the final layer. The results suggest practical benefits in employing early-exit decoding to achieve a favorable balance between factual accuracy and creative diversity, with broad implications for efficient, creative open-domain QA. The authors provide code and data at the referenced GitHub repository to enable replication and further exploration of layer-wise decoding strategies.
Abstract
Large language models (LLMs) are known to hallucinate, a phenomenon often linked to creativity. While previous research has primarily explored this connection through theoretical or qualitative lenses, our work takes a quantitative approach to systematically examine the relationship between hallucination and creativity in LLMs. Given the complex nature of creativity, we propose a narrow definition tailored to LLMs and introduce an evaluation framework, HCL, which quantifies Hallucination and Creativity across different Layers of LLMs during decoding. Our empirical analysis reveals a tradeoff between hallucination and creativity that is consistent across layer depth, model type, and model size. Notably, across different model architectures, we identify a specific layer at each model size that optimally balances this tradeoff. Additionally, the optimal layer tends to appear in the early layers of larger models, and the confidence of the model is also significantly higher at this layer. These findings provide a quantitative perspective that offers new insights into the interplay between LLM creativity and hallucination. The code and data for our experiments are available at https://github.com/ZicongHe2002/HCL-Spark.
