Table of Contents
Fetching ...

Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

Kishan Maharaj, Nandakishore Menon, Ashita Saxena, Srikanth Tamilselvam

TL;DR

The paper investigates robustness and reasoning fidelity of large language models when answering questions over long-code contexts across Python, COBOL, and Java. It introduces LongContextCodeQA, an extension of LongCodeBench, and a robustness suite with option shuffling, open-ended generation, and distractor (Needle-in-a-Haystack) perturbations to stress-test models across context windows up to $1{,}000{,}000$ tokens. Key findings reveal a persistent recognition–generation gap, strong sensitivity to irrelevant information, and pronounced retrieval biases, especially in COBOL, even for frontier models and long-context optimization. These results expose limitations in current long-context evaluations and establish a broader cross-language benchmark to guide future research toward more faithful, context-aware code reasoning in both legacy and modern software ecosystems.

Abstract

Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and context scale. Extending LongCodeBench Python dataset with new COBOL and Java question-answer sets, we evaluate state-of-the-art models under three settings: (i) shuffled multiple-choice options, (ii) open-ended questions and (iii) needle-in-a-haystack contexts containing relevant and adversarially irrelevant information. Results show substantial performance drops in both shuffled multiple-choice options and open-ended questions, and brittle behavior in the presence of irrelevant cues. Our findings highlight limitations of current long-context evaluations and provide a broader benchmark for assessing code reasoning in both legacy and modern systems.

Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

TL;DR

The paper investigates robustness and reasoning fidelity of large language models when answering questions over long-code contexts across Python, COBOL, and Java. It introduces LongContextCodeQA, an extension of LongCodeBench, and a robustness suite with option shuffling, open-ended generation, and distractor (Needle-in-a-Haystack) perturbations to stress-test models across context windows up to tokens. Key findings reveal a persistent recognition–generation gap, strong sensitivity to irrelevant information, and pronounced retrieval biases, especially in COBOL, even for frontier models and long-context optimization. These results expose limitations in current long-context evaluations and establish a broader cross-language benchmark to guide future research toward more faithful, context-aware code reasoning in both legacy and modern software ecosystems.

Abstract

Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and context scale. Extending LongCodeBench Python dataset with new COBOL and Java question-answer sets, we evaluate state-of-the-art models under three settings: (i) shuffled multiple-choice options, (ii) open-ended questions and (iii) needle-in-a-haystack contexts containing relevant and adversarially irrelevant information. Results show substantial performance drops in both shuffled multiple-choice options and open-ended questions, and brittle behavior in the presence of irrelevant cues. Our findings highlight limitations of current long-context evaluations and provide a broader benchmark for assessing code reasoning in both legacy and modern systems.
Paper Structure (23 sections, 4 figures, 6 tables)

This paper contains 23 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Accuracy trends across context lengths for Python dataset, comparing "With Options" and "Without Options" settings.
  • Figure 2: Accuracy trends for OPPSCAL dataset.
  • Figure 3: Accuracy trends for internal IBM COBOL dataset.
  • Figure 4: Accuracy trends across context lengths for all models on LongCodeBenchQA-Java. The figure highlights performance scaling from 32k to 1024k tokens for both multiple-choice and open-ended evaluation.