Table of Contents
Fetching ...

Evaluating Large Language Models in Vulnerability Detection Under Variable Context Windows

Jie Lin, David Mohaisen

TL;DR

This study investigates how tokenized Java code length impacts vulnerability detection across ten large language models using a unified Byte Pair Encoding tokenization and chi-square analysis. By evaluating restricted and extended context windows, the researchers quantify accuracy and explicitness of vulnerability judgments, revealing model-dependent robustness—GPT-4 shows consistent insensitivity to input length, while other models exhibit varying susceptibility. The work highlights that increasing model size alone does not guarantee length-agnostic performance and suggests that longer context windows or preprocessing techniques to reduce token counts can enhance reliability. These findings inform the design and deployment of LLM-based vulnerability detection tools in software security contexts, emphasizing model selection and input preprocessing to achieve stable performance.

Abstract

This study examines the impact of tokenized Java code length on the accuracy and explicitness of ten major LLMs in vulnerability detection. Using chi-square tests and known ground truth, we found inconsistencies across models: some, like GPT-4, Mistral, and Mixtral, showed robustness, while others exhibited a significant link between tokenized length and performance. We recommend future LLM development focus on minimizing the influence of input length for better vulnerability detection. Additionally, preprocessing techniques that reduce token count while preserving code structure could enhance LLM accuracy and explicitness in these tasks.

Evaluating Large Language Models in Vulnerability Detection Under Variable Context Windows

TL;DR

This study investigates how tokenized Java code length impacts vulnerability detection across ten large language models using a unified Byte Pair Encoding tokenization and chi-square analysis. By evaluating restricted and extended context windows, the researchers quantify accuracy and explicitness of vulnerability judgments, revealing model-dependent robustness—GPT-4 shows consistent insensitivity to input length, while other models exhibit varying susceptibility. The work highlights that increasing model size alone does not guarantee length-agnostic performance and suggests that longer context windows or preprocessing techniques to reduce token counts can enhance reliability. These findings inform the design and deployment of LLM-based vulnerability detection tools in software security contexts, emphasizing model selection and input preprocessing to achieve stable performance.

Abstract

This study examines the impact of tokenized Java code length on the accuracy and explicitness of ten major LLMs in vulnerability detection. Using chi-square tests and known ground truth, we found inconsistencies across models: some, like GPT-4, Mistral, and Mixtral, showed robustness, while others exhibited a significant link between tokenized length and performance. We recommend future LLM development focus on minimizing the influence of input length for better vulnerability detection. Additionally, preprocessing techniques that reduce token count while preserving code structure could enhance LLM accuracy and explicitness in these tasks.

Paper Structure

This paper contains 11 sections, 2 tables.