ICL Optimized Fragility
Serena Gomez Wannaz
TL;DR
The paper investigates how in-context learning (ICL) guides modulate cross-domain reasoning in GPT-OSS:20b, revealing a phenomenon called optimized fragility. Through a six-configuration prompting study over 840 tests spanning general knowledge, riddles, and a Math Olympiad problem, it shows that ICL can dramatically improve efficiency and accuracy on knowledge tasks (91–99%), while degrading performance on complex reasoning (10–43% on riddles), with the Olympiad largely unaffected ($p=0.2173$). ANOVA confirms robust effects across configurations ($p<0.001$), indicating that ICL guides impose systematic heuristics rather than merely memorizing data. The findings highlight crucial trade-offs for LLM deployment and safety, showing that prompt design can shift reasoning strategies in domain-specific ways, sometimes at the cost of general reasoning flexibility.
Abstract
ICL guides are known to improve task-specific performance, but their impact on cross-domain cognitive abilities remains unexplored. This study examines how ICL guides affect reasoning across different knowledge domains using six variants of the GPT-OSS:20b model: one baseline model and five ICL configurations (simple, chain-of-thought, random, appended text, and symbolic language). The models were subjected to 840 tests spanning general knowledge questions, logic riddles, and a mathematical olympiad problem. Statistical analysis (ANOVA) revealed significant behavioral modifications (p less than 0.001) across ICL variants, demonstrating a phenomenon termed "optimized fragility." ICL models achieved 91%-99% accuracy on general knowledge tasks while showing degraded performance on complex reasoning problems, with accuracy dropping to 10-43% on riddles compared to 43% for the baseline model. Notably, no significant differences emerged on the olympiad problem (p=0.2173), suggesting that complex mathematical reasoning remains unaffected by ICL optimization. These findings indicate that ICL guides create systematic trade-offs between efficiency and reasoning flexibility, with important implications for LLM deployment and AI safety.
