Table of Contents
Fetching ...

Repairing Regex Vulnerabilities via Localization-Guided Instructions

Sicheol Sung, Joonghyuk Hahn, Yo-Sub Han

TL;DR

The paper tackles ReDoS vulnerabilities in regexes by introducing Localized Regex Repair (LRR), a hybrid framework that localizes the vulnerability with a symbolic module and then uses an LLM to generate a semantically equivalent repair for the targeted sub-pattern. By decoupling localization from repair, LRR leverages precise pattern localization and the generalization capabilities of LLMs to handle complex cases that rule-based methods miss, achieving a reported improvement of up to 15.40 percentage points in repair rate over prior approaches. The evaluation, conducted on a 1000-regex polyglot dataset and multiple LLMs, reveals a trade-off between syntactic alterations and semantic fidelity, with reasoning LLMs often delivering higher semantic safety at the cost of longer repairs. The work demonstrates a practical, scalable path toward automated, reliable regex repair, while acknowledging limitations in the localization heuristic and the empirical nature of invulnerability validation.”

Abstract

Regular expressions (regexes) are foundational to modern computing for critical tasks like input validation and data parsing, yet their ubiquity exposes systems to regular expression denial of service (ReDoS), a vulnerability requiring automated repair methods. Current approaches, however, are hampered by a trade-off. Symbolic, rule-based system are precise but fails to repair unseen or complex vulnerability patterns. Conversely, large language models (LLMs) possess the necessary generalizability but are unreliable for tasks demanding strict syntactic and semantic correctness. We resolve this impasse by introducing a hybrid framework, localized regex repair (LRR), designed to harness LLM generalization while enforcing reliability. Our core insight is to decouple problem identification from the repair process. First, a deterministic, symbolic module localizes the precise vulnerable subpattern, creating a constrained and tractable problem space. Then, the LLM invoked to generate a semantically equivalent fix for this isolated segment. This combined architecture successfully resolves complex repair cases intractable for rule-based repair while avoiding the semantic errors of LLM-only approaches. Our work provides a validated methodology for solving such problems in automated repair, improving the repair rate by 15.4%p over the state-of-the-art. Our code is available at https://github.com/cdltlehf/LRR.

Repairing Regex Vulnerabilities via Localization-Guided Instructions

TL;DR

The paper tackles ReDoS vulnerabilities in regexes by introducing Localized Regex Repair (LRR), a hybrid framework that localizes the vulnerability with a symbolic module and then uses an LLM to generate a semantically equivalent repair for the targeted sub-pattern. By decoupling localization from repair, LRR leverages precise pattern localization and the generalization capabilities of LLMs to handle complex cases that rule-based methods miss, achieving a reported improvement of up to 15.40 percentage points in repair rate over prior approaches. The evaluation, conducted on a 1000-regex polyglot dataset and multiple LLMs, reveals a trade-off between syntactic alterations and semantic fidelity, with reasoning LLMs often delivering higher semantic safety at the cost of longer repairs. The work demonstrates a practical, scalable path toward automated, reliable regex repair, while acknowledging limitations in the localization heuristic and the empirical nature of invulnerability validation.”

Abstract

Regular expressions (regexes) are foundational to modern computing for critical tasks like input validation and data parsing, yet their ubiquity exposes systems to regular expression denial of service (ReDoS), a vulnerability requiring automated repair methods. Current approaches, however, are hampered by a trade-off. Symbolic, rule-based system are precise but fails to repair unseen or complex vulnerability patterns. Conversely, large language models (LLMs) possess the necessary generalizability but are unreliable for tasks demanding strict syntactic and semantic correctness. We resolve this impasse by introducing a hybrid framework, localized regex repair (LRR), designed to harness LLM generalization while enforcing reliability. Our core insight is to decouple problem identification from the repair process. First, a deterministic, symbolic module localizes the precise vulnerable subpattern, creating a constrained and tractable problem space. Then, the LLM invoked to generate a semantically equivalent fix for this isolated segment. This combined architecture successfully resolves complex repair cases intractable for rule-based repair while avoiding the semantic errors of LLM-only approaches. Our work provides a validated methodology for solving such problems in automated repair, improving the repair rate by 15.4%p over the state-of-the-art. Our code is available at https://github.com/cdltlehf/LRR.

Paper Structure

This paper contains 29 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Performance comparison of the LRR framework against the baseline. The plot compares LRR implementations (yellow) against the rule-based baseline (blue), demonstrating successful improvement in repair rate (y-axis) across semantic similarity (x-axis).
  • Figure 2: An overview of our LLR framework. Given a vulnerable regex, the symbolic module first analyzes and localizes the vulnerability-inducing sub-pattern. This identified sub-pattern is then integrated into the LLMs' prompt, allowing the models to focus on the problematic segment and maximize the repair rate.
  • Figure 3: Box-and-whisker plots of the RLI (top) and NLS (bottom) of successfully repaired cases for each method. For the LRR method, we exclude all cases successfully repaired by the baseline RegexScalpel to isolate and highlight the unique contribution of LLMs.
  • Figure 4: Histograms illustrating the distribution of Jaccard Similarity scores for successfully repaired regexes.